1. Introduction
Recommender systems (RS) aims to address the information explosion and meet users personalized interests. One of the most popular recommendation techniques is collaborative filtering (CF) (Koren et al., 2009), which utilizes users’ historical interactions and makes recommendations based on their common preferences. However, CFbased methods usually suffer from the sparsity of useritem interactions and the cold start problem. Therefore, researchers propose using side information in recommender systems, including social networks (Jamali and Ester, 2010), attributes (Wang et al., 2018b), and multimedia (e.g., texts (Wang et al., 2015), images (Zhang et al., 2016)). Knowledge graphs (KGs) are one type of side information for RS, which usually contain fruitful facts and connections about items. Recently, researchers have proposed several academic and commercial KGs, such as NELL^{1}^{1}1http://rtw.ml.cmu.edu/rtw/, DBpedia^{2}^{2}2http://wiki.dbpedia.org/, Google Knowledge Graph^{3}^{3}3https://developers.google.com/knowledgegraph/ and Microsoft Satori^{4}^{4}4https://searchengineland.com/library/bing/bingsatori. Due to its high dimensionality and heterogeneity, a KG is usually preprocessed by knowledge graph embedding (KGE) methods (Wang et al., 2018a)
, which embeds entities and relations into lowdimensional vector spaces while preserving its inherent structure.
Existing KGaware methods
Inspired by the success of applying KG in a wide variety of tasks, researchers have recently tried to utilize KG to improve the performance of recommender systems (Yu et al., 2014; Zhao et al., 2017; Wang et al., 2018d; Wang et al., 2018c; Zhang et al., 2016). Personalized Entity Recommendation (PER) (Yu et al., 2014) and Factorization Machine with Group lasso (FMG) (Zhao et al., 2017) treat KG as a heterogeneous information network, and extract metapath/metagraph based latent features to represent the connectivity between users and items along different types of relation paths/graphs. It should be noted that PER and FMG rely heavily on manually designed metapaths/metagraphs, which limits its application in generic recommendation scenarios. Deep Knowledgeaware Network (DKN) (Wang et al., 2018d) designs a CNN framework to combine entity embeddings with word embeddings for news recommendation. However, the entity embeddings are required in advance of using DKN, causing DKN to lack an endtoend way of training. Another concern about DKN is that it can hardly incorporate side information other than texts. RippleNet (Wang et al., 2018c) is a memorynetworklike model that propagates users’ potential preferences in the KG and explores their hierarchical interests. But the importance of relations is weakly characterized in RippleNet, because the embedding matrix of a relation can hardly be trained to capture the sense of importance in the quadratic form ( and are embedding vectors of two entities). Collaborative Knowledge base Embedding (CKE) (Zhang et al., 2016) combines CF with structural knowledge, textual knowledge, and visual knowledge in a unified framework. However, the KGE module in CKE (i.e., TransR (Lin et al., 2015)) is more suitable for ingraph applications (such as KG completion and link prediction) rather than recommendation. In addition, the CF module and the KGE module are loosely coupled in CKE under a Bayesian framework, making the supervision from KG less obvious for recommender systems.
The proposed approach
To address the limitations of previous work, we propose MKR, a multitask learning (MTL) approach for knowledge graph enhanced recommendation. MKR is a generic, endtoend deep recommendation framework, which aims to utilize KGE task to assist recommendation task^{5}^{5}5KGE task can also benefit from recommendation task empirically as shown in the experiments section.. Note that the two tasks are not mutually independent, but are highly correlated since an item in RS may associate with one or more entities in KG. Therefore, an item and its corresponding entity are likely to have a similar proximity structure in RS and KG, and share similar features in lowlevel and nontaskspecific latent feature spaces (Long et al., 2017). We will further validate the similarity in the experiments section. To model the shared features between items and entities, we design a crosscompress unit in MKR. The crosscompress unit explicitly models highorder interactions between item and entity features, and automatically control the cross knowledge transfer for both tasks. Through crosscompress units, representations of items and entities can complement each other, assisting both tasks in avoiding fitting noises and improving generalization. The whole framework can be trained by alternately optimizing the two tasks with different frequencies, which endows MKR with high flexibility and adaptability in real recommendation scenarios.
We probe the expressive capability of MKR and show, through theoretical analysis, that the crosscompress unit is capable of approximating sufficiently high order feature interactions between items and entities. We also show that MKR is a generalized framework over several representative methods of recommender systems and multitask learning, including factorization machines (Rendle, 2010, 2012), deepcross network (Wang et al., 2017a), and crossstitch network (Misra et al., 2016). Empirically, we evaluate our method in four recommendation scenarios, i.e., movie, book, music, and news recommendations. The results demonstrate that MKR achieves substantial gains over stateoftheart baselines in both clickthrough rate (CTR) prediction (e.g., improvements on average for movies) and top recommendation (e.g., improvements on average for books). MKR can also maintain a decent performance in sparse scenarios.
Contribution
It is worth noticing that the problem studied in this paper can also be modelled as crossdomain recommendation (Tang et al., 2012) or transfer learning (Pan et al., 2010)
, since we care more about the performance of recommendation task. However, the key observation is that though crossdomain recommendation and transfer learning have single objective for the target domain, their loss functions still contain constraint terms for measuring data distribution in the source domain or similarity between two domains. In our proposed MKR, the KGE task serves as the constraint term
explicitly to provide regularization for recommender systems. We would like to emphasize that the major contribution of this paper is exactly modeling the problem as multitask learning: We go a step further than crossdomain recommendation and transfer learning by finding that the intertask similarity is helpful to not only recommender systems but also knowledge graph embedding, as shown in theoretical analysis and experiment results.2. Our Approach
In this section, we first formulate the knowledge graph enhanced recommendation problem, then introduce the framework of MKR and present the design of the crosscompress unit, recommendation module and KGE module in detail. We lastly discuss the learning algorithm for MKR.
2.1. Problem Formulation
We formulate the knowledge graph enhanced recommendation problem in this paper as follows. In a typical recommendation scenario, we have a set of users and a set of items . The useritem interaction matrix is defined according to users’ implicit feedback, where indicates that user engaged with item , such as behaviors of clicking, watching, browsing, or purchasing; otherwise . Additionally, we also have access to a knowledge graph , which is comprised of entityrelationentity triples . Here , , and denote the head, relation, and tail of a knowledge triple, respectively. For example, the triple (Quentin Tarantino, film.director.film, Pulp Fiction) states the fact that Quentin Tarantino directs the film Pulp Fiction. In many recommendation scenarios, an item may associate with one or more entities in . For example, in movie recommendation, the item ”Pulp Fiction” is linked with its namesake in a KG, while in news recommendation, news with the title ”Trump pledges aid to Silicon Valley during tech meeting” is linked with entities ”Donald Trump” and ”Silicon Valley” in a KG.
Given the useritem interaction matrix as well as the knowledge graph , we aim to predict whether user has potential interest in item with which he has had no interaction before. Our goal is to learn a prediction function , where
denotes the probability that user
will engage with item , and is the model parameters of function .2.2. Framework
The framework of MKR is illustrated in Figure 0(a). MKR consists of three main components: recommendation module, KGE module, and cross
compress units. (1) The recommendation module on the left takes a user and an item as input, and uses a multilayer perceptron (MLP) and cross
compress units to extract short and dense features for the user and the item, respectively. The extracted features are then fed into another MLP together to output the predicted probability. (2) Similar to the left part, the KGE module in the right part also uses multiple layers to extract features from the head and relation of a knowledge triple, and outputs the representation of the predicted tail under the supervision of a score function and the real tail. (3) The recommendation module and the KGE module are bridged by specially designed crosscompress units. The proposed unit can automatically learn highorder feature interactions of items in recommender systems and entities in the KG.2.3. Crosscompress Unit
To model feature interactions between items and entities, we design a crosscompress unit in MKR framework. As shown in Figure 0(b), for item and one of its associated entities , we first construct pairwise interactions of their latent feature and from layer :
(1) 
where is the cross feature matrix of layer , and is the dimension of hidden layers. This is called the cross operation, since each possible feature interaction between item and its associated entity is modeled explicitly in the cross feature matrix. We then output the feature vectors of items and entities for the next layer by projecting the cross feature matrix into their latent representation spaces:
(2) 
where and
are trainable weight and bias vectors. This is called the
compress operation, since the weight vectors project the cross feature matrix from space back to the feature spaces . Note that in Eq. (2), the cross feature matrix is compressed along both horizontal and vertical directions (by operating on and ) for the sake of symmetry, but we will provide more insights of the design in Section 3.2. For simplicity, the crosscompress unit is denoted as:(3) 
and we use a suffix or to distinguish its two outputs in the following of this paper. Through crosscompress units, MKR can adaptively adjust the weights of knowledge transfer and learn the relevance between the two tasks.
It should be noted that crosscompress units should only exist in lowlevel layers of MKR, as shown in Figure 0(a). This is because: (1) In deep architectures, features usually transform from general to specific along the network, and feature transferability drops significantly in higher layers with increasing task dissimilarity (Yosinski et al., 2014). Therefore, sharing highlevel layers risks to possible negative transfer, especially for the heterogeneous tasks in MKR. (2) In highlevel layers of MKR, item features are mixed with user features, and entity features are mixed with relation features. The mixed features are not suitable for sharing since they have no explicit association.
2.4. Recommendation Module
The input of the recommendation module in MKR consists of two raw feature vectors and that describe user and item , respectively. and can be customized as onehot ID (He et al., 2017), attributes (Wang et al., 2018b), bagofwords (Wang et al., 2015), or their combinations, based on the application scenario. Given user ’s raw feature vector , we use an layer MLP to extract his latent condensed feature^{6}^{6}6We use the exponent notation in Eq. (4) and following equations in the rest of this paper for simplicity, but note that the parameters of layers are actually different.:
(4) 
where
is a fullyconnected neural network layer
^{7}^{7}7Exploring a more elaborate design of layers in the recommendation module is an important direction of future work. with weight , bias, and nonlinear activation function
. For item , we use crosscompress units to extract its feature:(5) 
where is the set of associated entities of item .
After having user ’s latent feature and item ’s latent feature , we combine the two pathways by a predicting function , for example, inner product or an layer MLP. The final predicted probability of user engaging item is:
(6) 
2.5. Knowledge Graph Embedding Module
Knowledge graph embedding is to embed entities and relations into continuous vector spaces while preserving their structure. Recently, researchers have proposed a great many KGE methods, including translational distance models (Bordes et al., 2013; Lin et al., 2015) and semantic matching models (Nickel et al., 2016; Liu et al., 2017). In MKR, we propose a deep semantic matching architecture for KGE module. Similar to the recommendation module, for a given knowledge triple , we first utilize multiple crosscompress units and nonlinear layers to process the raw feature vectors of head and relation (including ID (Lin et al., 2015), types (Xie et al., 2016), textual description (Wang et al., 2014), etc.), respectively. Their latent features are then concatenated together, followed by a layer MLP for predicting tail :
(7) 
where is the set of associated items of entity , and is the predicted vector of tail . Finally, the score of the triple is calculated using a score (similarity) function :
(8) 
where is the real feature vector of . In this paper, we use the normalized inner product as the choice of score function (Misra et al., 2016)
, but other forms of (dis)similarity metrics can also be applied here such as Kullback–Leibler divergence.
2.6. Learning Algorithm
The complete loss function of MKR is as follows:
(9) 
In Eq. (9), the first term measures loss in the recommendation module, where and traverse the set of users and the items, respectively, and is the crossentropy function. The second term calculates the loss in the KGE module, in which we aim to increase the score for all true triples while reducing the score for all false triples. The last item is the regularization term for preventing overfitting, and are the balancing parameters.^{8}^{8}8 can be seen as the ratio of two learning rates for the two tasks.
Note that the loss function in Eq. (9) traverses all possible useritem pairs and knowledge triples. To make computation more efficient, following (Mikolov et al., 2013), we use a negative sampling strategy during training.
The learning algorithm of MKR is presented in Algorithm 1, in which a training epoch consists of two stages: recommendation task (line 37) and KGE task (line 810). In each iteration, we repeat training on recommendation task for
times ( is a hyperparameter and normally ) before training on KGE task once in each epoch, since we are more focused on improving recommendation performance. We will discuss the choice of in the experiments section.3. Theoretical Analysis
In this section, we prove that crosscompress units have sufficient capability of polynomial approximation. We also show that MKR is a generalized framework over several representative methods of recommender systems and multitask learning.
3.1. Polynomial Approximation
According to the Weierstrass approximation theorem (Rudin et al., 1964), any function under certain smoothness assumption can be approximated by a polynomial to an arbitrary accuracy. Therefore, we examine the ability of highorder interaction approximation of the crosscompress unit. We show that crosscompress units can model the order of itementity feature interaction up to exponential degree:
Theorem 1 ().
Denote the input of item and entity in MKR network as and , respectively. Then the cross terms about and in and (the L1norm of and ) with maximal degree is , where , for , , and ().
In recommender systems, is also called combinatorial feature, as it measures the interactions of multiple original features. Theorem 1 states that crosscompress units can automatically model the combinatorial features of items and entities for sufficiently high order, which demonstrates the superior approximation capacity of MKR as compared with existing work such as WideDeep (Cheng et al., 2016), factorization machines (Rendle, 2010, 2012) and DCN (Wang et al., 2017a). The proof of Theorem 1 is provided in the Appendix. Note that Theorem 1 gives a theoretical view of the polynomial approximation ability of the crosscompress unit rather than providing guarantees on its actual performance. We will empirically evaluate the crosscompress unit in the experiments section.
3.2. Unified View of Representative Methods
In the following we provide a unified view of several representative models in recommender systems and multitask learning, by showing that they are restricted versions of or theoretically related to MKR. This justifies the design of crosscompress unit and conceptually explains its strong empirical performance as compared to baselines.
3.2.1. Factorization machines
Factorization machines (Rendle, 2010, 2012)
are a generic method for recommender systems. Given an input feature vector, FMs model all interactions between variables in the input vector using factorized parameters, thus being able to estimate interactions in problems with huge sparsity such as recommender systems. The model equation for a 2degree factorization machine is defined as
(10) 
where is the th unit of input vector , is weight scalar, is weight vector, and is dot product of two vectors. We show that the essence of FM is conceptually similar to an 1layer crosscompress unit:
Proposition 0 ().
The L1norm of and can be written as the following form:
(11) 
where is the sum of two scalars.
It is interesting to notice that, instead of factorizing the weight parameter of into the dot product of two vectors as in FM, the weight of term is factorized into the sum of two scalars in crosscompress unit to reduce the number of parameters and increase robustness of the model.
3.2.2. DeepCross Network
DCN (Wang et al., 2017a) learns explicit and highorder cross features by introducing the layers:
(12) 
where , , and are representation, weight, and bias of the th layer. We demonstrate the link between DCN and MKR by the following proposition:
Proposition 0 ().
In the formula of in Eq. (2), if we restrict in the first term to satisfy and restrict in the second term to be (and impose similar restrictions on ), the crosscompress unit is then conceptually equivalent to DCN layer in the sense of multitask learning:
(13) 
It can be proven that the polynomial approximation ability of the above DCNequivalent version (i.e., the maximal degree of cross terms in and ) is , which is weaker than original crosscompress units with approximation ability.
3.2.3. Crossstitch Networks
Crossstitch networks (Misra et al., 2016) is a multitask learning model in convolutional networks, in which the designed crossstitch unit can learn a combination of shared and taskspecific representations between two tasks. Specifically, given two activation maps and from layer for both the tasks, crossstitch networks learn linear combinations and of both the input activations and feed these combinations as input to the next layers’ filters. The formula at location in the activation map is
(14) 
where ’s are trainable transfer weights of representations between task A and task B. We show that the crossstitch unit in Eq. (14) is a simplified version of our crosscompress unit by the following proposition:
Proposition 0 ().
If we omit all biases in Eq. (2), the crosscompress unit can be written as
(15) 
The transfer matrix in Eq. (15) serves as the crossstitch unit in Eq. (14). Like crossstitch networks, MKR network can decide to make certain layers task specific by setting () or () to zero, or choose a more shared representation by assigning a higher value to them. But the transfer matrix is more finegrained in crosscompress unit, because the transfer weights are replaced from scalars to dot products of two vectors. It is rather interesting to notice that Eq. (15) can also be regarded as an attention mechanism (Bahdanau et al., 2015), as the computation of transfer weights involves the feature vectors and themselves.
4. Experiments
In this section, we evaluate the performance of MKR in four realworld recommendation scenarios: movie, book, music, and news^{9}^{9}9The source code is available at https://github.com/hwwang55/MKR..
Dataset  # users  # items  # interactions  # KG triples  Hyperparameters 

MovieLens1M  6,036  2,347  753,772  20,195  , , , 
BookCrossing  17,860  14,910  139,746  19,793  , , , 
Last.FM  1,872  3,846  42,346  15,518  = 2, , , 
BingNews  141,487  535,145  1,025,192  1,545,217  , , , 
4.1. Datasets
We utilize the following four datasets in our experiments:

MovieLens1M^{10}^{10}10https://grouplens.org/datasets/movielens/1m/ is a widely used benchmark dataset in movie recommendations, which consists of approximately 1 million explicit ratings (ranging from 1 to 5) on the MovieLens website.

BookCrossing^{11}^{11}11http://www2.informatik.unifreiburg.de/~cziegler/BX/ dataset contains 1,149,780 explicit ratings (ranging from 0 to 10) of books in the BookCrossing community.

Last.FM^{12}^{12}12https://grouplens.org/datasets/hetrec2011/ dataset contains musician listening information from a set of 2 thousand users from Last.fm online music system.

BingNews dataset contains 1,025,192 pieces of implicit feedback collected from the server logs of Bing News^{13}^{13}13https://www.bing.com/news from October 16, 2016 to August 11, 2017. Each piece of news has a title and a snippet.
Since MovieLens1M, BookCrossing, and Last.FM are explicit feedback data (Last.FM provides the listening count as weight for each useritem interaction), we transform them into implicit feedback where each entry is marked with 1 indicating that the user has rated the item positively, and sample an unwatched set marked as 0 for each user. The threshold of positive rating is 4 for MovieLens1M, while no threshold is set for BookCrossing and Last.FM due to their sparsity.
We use Microsoft Satori to construct the KG for each dataset. We first select a subset of triples from the whole KG with a confidence level greater than 0.9. For MovieLens1M and BookCrossing, we additionally select a subset of triples from the subKG whose relation name contains ”film” or ”book” respectively to further reduce KG size.
Given the subKGs, for MovieLens1M, BookCrossing, and Last.FM, we collect IDs of all valid movies, books, or musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), respectively. For simplicity, items with no matched or multiple matched entities are excluded. We then match the IDs with the head and tail of all KG triples and select all wellmatched triples from the subKG. The constructing process is similar for BingNews except that: (1) we use entity linking tools to extract entities in news titles; (2) we do not impose restrictions on the names of relations since the entities in news titles are not within one particular domain. The basic statistics of the four datasets are presented in Table 1. Note that the number of users, items, and interactions are smaller than original datasets since we filtered out items with no corresponding entity in the KG.
4.2. Baselines
We compare our proposed MKR with the following baselines. Unless otherwise specified, the hyperparameter settings of baselines are the same as reported in their original papers or as default in their codes.

PER (Yu et al., 2014) treats the KG as heterogeneous information networks and extracts metapath based features to represent the connectivity between users and items. In this paper, we use manually designed useritemattributeitem paths as features, i.e., ”usermoviedirectormovie”, ”usermoviegenremovie”, and ”usermoviestarmovie” for MovieLens20M; ”userbookauthorbook” and ”userbookgenrebook” for BookCrossing; ”usermusiciangenremusician”, ”usermusiciancountrymusician”, and ”usermusicianagemusician” (age is discretized) for Last.FM. Note that PER cannot be applied to news recommendation because it’s hard to predefine metapaths for entities in news.

CKE (Zhang et al., 2016) combines CF with structural, textual, and visual knowledge in a unified framework for recommendation. We implement CKE as CF plus structural knowledge module in this paper. The dimension of user and item embeddings for the four datasets are set as 64, 128, 32, 64, respectively. The dimension of entity embeddings is .

DKN (Wang et al., 2018d) treats entity embedding and word embedding as multiple channels and combines them together in CNN for CTR prediction. In this paper, we use movie/book names and news titles as textual input for DKN. The dimension of word embedding and entity embedding is 64, and the number of filters is 128 for each window size 1, 2, 3.

RippleNet (Wang et al., 2018c) is a memorynetworklike approach that propagates users’ preferences on the knowledge graph for recommendation. The hyperparameter settings for Last.FM are , , , , .

LibFM (Rendle, 2012) is a widely used featurebased factorization model. We concatenate the raw features of users and items as well as the corresponding averaged entity embeddings learned from TransR (Lin et al., 2015) as input for LibFM. The dimension is {1, 1, 8} and the number of training epochs is 50. The dimension of TransR is 32.

WideDeep (Cheng et al., 2016) is a deep recommendation model combining a (wide) linear channel with a (deep) nonlinear channel. The input for WideDeep is the same as in LibFM. The dimension of user, item, and entity is 64, and we use a twolayer deep channel with dimension of 100 and 50 as well as a wide channel.
Model  MovieLens1M  BookCrossing  Last.FM  BingNews  

PER  0.710 (22.6%)  0.664 (21.2%)  0.623 (15.1%)  0.588 (16.7%)  0.633 (20.6%)  0.596 (20.7%)     
CKE  0.801 (12.6%)  0.742 (12.0%)  0.671 (8.6%)  0.633 (10.3%)  0.744 (6.6%)  0.673 (10.5%)  0.553 (19.7%)  0.516 (20.0%) 
DKN  0.655 (28.6%)  0.589 (30.1%)  0.622 (15.3%)  0.598 (15.3%)  0.602 (24.5%)  0.581 (22.7%)  0.667 (3.2%)  0.610 (5.4%) 
RippleNet  0.920 (+0.3%)  0.842 (0.1%)  0.729 (0.7%)  0.662 (6.2%)  0.768 (3.6%)  0.691 (8.1%)  0.678 (1.6%)  0.630 (2.3%) 
LibFM  0.892 (2.7%)  0.812 (3.7%)  0.685 (6.7%)  0.640 (9.3%)  0.777 (2.5%)  0.709 (5.7%)  0.640 (7.1%)  0.591 (8.4%) 
WideDeep  0.898 (2.1%)  0.820 (2.7%)  0.712 (3.0%)  0.624 (11.6%)  0.756 (5.1%)  0.688 (8.5%)  0.651 (5.5%)  0.597 (7.4%) 
MKR  0.917  0.843  0.734  0.704  0.797  0.752  0.689  0.645 
MKR1L          0.795 (0.3%)  0.749 (0.4%)  0.680 (1.3%)  0.631 (2.2%) 
MKRDCN  0.883 (3.7%)  0.802 (4.9%)  0.705 (4.3%)  0.676 (4.2%)  0.778 (2.4%)  0.730 (2.9%)  0.671 (2.6%)  0.614 (4.8%) 
MKRstitch  0.905 (1.3%)  0.830 (1.5%)  0.721 (2.2%)  0.682 (3.4%)  0.772 (3.1%)  0.725 (3.6%)  0.674 (2.2%)  0.621 (3.7%) 
4.3. Experiments setup
In MKR, we set the number of highlevel layers , as inner product, and for all three datasets, and other hyperparameter are given in Table 1. The settings of hyperparameters are determined by optimizing on a validation set. For each dataset, the ratio of training, validation, and test set is . Each experiment is repeated times, and the average performance is reported. We evaluate our method in two experiment scenarios: (1) In clickthrough rate (CTR) prediction, we apply the trained model to each piece of interactions in the test set and output the predicted click probability. We use and to evaluate the performance of CTR prediction. (2) In top recommendation, we use the trained model to select items with highest predicted click probability for each user in the test set, and choose and to evaluate the recommended sets.
4.4. Empirical study
We conduct an empirical study to investigate the correlation of items in RS and their corresponding entities in KG. Specifically, we aim to reveal how the number of common neighbors of an item pair in KG changes with their number of common raters in RS. To this end, we first randomly sample 1 million item pairs from MovieLens1M. We then classify each pair into 5 categories based on the number of their common raters in RS, and count their average number of common neighbors in KG for each category. The result is presented in Figure
1(a), which clearly shows that if two items have more common raters in RS, they are likely to share more common neighbors in KG. Figure 1(b) shows the positive correlation from an opposite direction. The above findings empirically demonstrate that items share the similar structure of proximity in KG and RS, thus the cross knowledge transfer of items benefits both recommendation and KGE tasks in MKR.4.5. Results
4.5.1. Comparison with baselines
The results of all methods in CTR prediction and top recommendation are presented in Table 2 and Figure 3, 4, respectively. We have the following observations:

PER performs poor on movie, book, and music recommendation because the userdefined metapaths can hardly be optimal in reality. Moreover, PER cannot be applied to news recommendation.

CKE performs better in movie, book, and music recommendation than news. This may be because MovieLens1M, BookCrossing, and Last.FM are much denser than BingNews, which is more favorable for the collaborative filtering part in CKE.

DKN performs best in news recommendation compared with other baselines, but performs worst in other scenarios. This is because movie, book, and musician names are too short and ambiguous to provide useful information.

RippleNet performs best among all baselines, and even outperforms MKR on MovieLens1M. This demonstrates that RippleNet can precisely capture user interests, especially in the case where useritem interactions are dense. However, RippleNet is more sensitive to the density of datasets, as it performs worse than MKR in BookCrossing, Last.FM, and BingNews. We will further study their performance in sparse scenarios in Section 4.5.3.

In general, our MKR performs best among all methods on the four datasets. Specifically, MKR achieves average gains of , , , and in movie, book, music, and news recommendation, respectively, which demonstrates the efficacy of the multitask learning framework in MKR. Note that the top metrics are much lower for BingNews because the number of news is significantly larger than movies, books, and musicians.
Model  

PER  0.598  0.607  0.621  0.638  0.647  0.662  0.675  0.688  0.697  0.710 
CKE  0.674  0.692  0.705  0.716  0.739  0.754  0.768  0.775  0.797  0.801 
DKN  0.579  0.582  0.589  0.601  0.612  0.620  0.631  0.638  0.646  0.655 
RippleNet  0.843  0.851  0.859  0.862  0.870  0.878  0.890  0.901  0.912  0.920 
LibFM  0.801  0.810  0.816  0.829  0.837  0.850  0.864  0.875  0.886  0.892 
WideDeep  0.788  0.802  0.809  0.815  0.821  0.840  0.858  0.876  0.884  0.898 
MKR  0.868  0.874  0.881  0.882  0.889  0.897  0.903  0.908  0.913  0.917 
4.5.2. Comparison with MKR variants
We further compare MKR with its three variants to demonstrate the efficacy of crosscompress unit:

MKR1L is MKR with one layer of crosscompress unit, which corresponds to FM model according to Proposition 2. Note that MKR1L is actually MKR in the experiments for MovieLens1M.

MKRDCN is a variant of MKR based on Eq. (13), which corresponds to DCN model.

MKRstitch is another variant of MKR corresponding to the crossstitch network, in which the transfer weights in Eq. (15) are replaced by four trainable scalars.
From Table 2 we observe that MKR outperforms MKR1L and MKRDCN, which shows that modeling highorder interactions between item and entity features is helpful for maintaining decent performance. MKR also achieves better scores than MKRstitch. This validates the efficacy of finegrained control on knowledge transfer in MKR compared with the simple crossstitch units.
dataset  KGE  KGE + RS 

MovieLens1M  0.319  0.302 
BookCrossing  0.596  0.558 
Last.FM  0.480  0.471 
BingNews  0.488  0.459 
4.5.3. Results in sparse scenarios
One major goal of using knowledge graph in MKR is to alleviate the sparsity and the cold start problem of recommender systems. To investigate the efficacy of the KGE module in sparse scenarios, we vary the ratio of training set of MovieLens1M from to (while the validation and test set are kept fixed), and report the results of in CTR prediction for all methods. The results are shown in Table 3. We observe that the performance of all methods deteriorates with the reduce of the training set. When , the score decreases by , , , , , for PER, CKE, DKN, RippleNet, LibFM, and WideDeep, respectively, compared with the case when full training set is used (). In contrast, the score of MKR only decreases by , which demonstrates that MKR can still maintain a decent performance even when the useritem interaction is sparse. We also notice that MKR performs better than RippleNet in sparse scenarios, which is accordance with our observation in Section 4.5.1 that RippleNet is more sensitive to the density of useritem interactions.
4.5.4. Results on KGE side
Although the goal of MKR is to utilize KG to assist with recommendation, it is still interesting to investigate whether the RS task benefits the KGE task, since the principle of multitask learning is to leverage shared information to help improve the performance of all tasks (Zhang and Yang, 2017). We present the result of (rooted mean square error) between predicted and real vectors of tails in the KGE task in Table 4. Fortunately, we find that the existence of RS module can indeed reduce the prediction error by . The results show that the crosscompress units are able to learn general and shared features that mutually benefit both sides of MKR.
4.6. Parameter Sensitivity
4.6.1. Impact of KG size
We vary the size of KG to further investigate the efficacy of usage of KG. The results of on BingNews are plotted in Figure 4(a). Specifically, the and is enhanced by and with the KG ratio increasing from to in three scenarios, respectively. This is because the BingNews dataset is extremely sparse, making the effect of KG usage rather obvious.
4.6.2. Impact of RS training frequency
We investigate the influence of parameters in MKR by varying from 1 to 10, while keeping other parameters fixed. The results are presented in Figure 4(b). We observe that MKR achieves the best performance when . This is because a high training frequency of the KGE module will mislead the objective function of MKR, while too small of a training frequency of KGE cannot make full use of the transferred knowledge from the KG.
4.6.3. Impact of embedding dimension
We also show how the dimension of users, items, and entities affects the performance of MKR in Figure 4(c). We find that the performance is initially improved with the increase of dimension, because more bits in embedding layer can encode more useful information. However, the performance drops when the dimension further increases, as too large number of dimensions may introduce noises which mislead the subsequent prediction.
5. Related Work
5.1. Knowledge Graph Embedding
The KGE module in MKR connects to a large body of work in KGE methods. KGE is used to embed entities and relations in a knowledge into lowdimensional vector spaces while still preserving the structural information (Wang et al., 2017b). KGE methods can be classified into the following two categories: (1) Translational distance models exploit distancebased scoring functions when learning representations of entities and relations, such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014), and TransR (Lin et al., 2015); (2) Semantic matching models measure plausibility of knowledge triples by matching latent semantics of entities and relations, such as RESCAL (Nickel et al., 2011), ANALOGY (Nickel et al., 2016), and HolE (Liu et al., 2017). Recently, researchers also propose incorporating auxiliary information, such as entity types (Xie et al., 2016), logic rules (Rocktäschel et al., 2015), and textual descriptions (Zhong et al., 2015) to assist KGE. The above KGE methods can also be incorporated into MKR as the implementation of the KGE module, but note that the crosscompress unit in MKR needs to be redesigned accordingly. Exploring other designs of KGE module as well as the corresponding bridging unit is also an important direction of future work.
5.2. MultiTask Learning
Multitask learning is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks
(Zhang and Yang, 2017). All of the learning tasks are assumed to be related to each other, and it is found that learning these tasks jointly can lead to performance improvement compared with learning them individually. In general, MTL algorithms can be classified into several categories, including feature learning approach (Zhang et al., 2015; Wang et al., 2017a), lowrank approach (Han and Zhang, 2016; McDonald et al., 2014), task clustering approach (Zhou and Zhao, 2016), task relation learning approach (Lee et al., 2016), and decomposition approach (Han and Zhang, 2015). For example, the crossstitch network (Zhang et al., 2015) determines the inputs of hidden layers in different tasks by a knowledge transfer matrix; Zhou et. al (Zhou and Zhao, 2016) aims to cluster tasks by identifying representative tasks which are a subset of the given tasks, i.e., if task is selected by task as a representative task, then it is expected that model parameters for are similar to those of. MTL can also be combined with other learning paradigms to improve the performance of learning tasks further, including semisupervised learning, active learning, unsupervised learning,and reinforcement learning.
5.3. Deep Recommender Systems
Recently, deep learning has been revolutionizing recommender systems and achieves better performance in many recommendation scenarios. Roughly speaking, deep recommender systems can be classified into two categories: (1) Using deep neural networks to process the raw features of users or items
(Wang et al., 2015, 2018b; Zhang et al., 2016; Wang et al., 2017c; Guo et al., 2017); For example, Collaborative Deep Learning (Wang et al., 2015)designs autoencoders to extract short and dense features from textual input and feeds the features into a collaborative filtering module; DeepFM
(Guo et al., 2017) combines factorization machines for recommendation and deep learning for feature learning in a neural network architecture. (2) Using deep neural networks to model the interaction among users and items (Huang et al., 2013; Cheng et al., 2016; Covington et al., 2016; He et al., 2017). For example, Neural Collaborative Filtering (He et al., 2017) replaces the inner product with a neural architecture to model the useritem interaction. The major difference between these methods and ours is that MKR deploys a multitask learning framework that utilizes the knowledge from a KG to assist recommendation.6. Conclusions and Future Work
This paper proposes MKR, a multitask learning approach for knowledge graph enhanced recommendation. MKR is a deep and endtoend framework that consists of two parts: the recommendation module and the KGE module. Both modules adopt multiple nonlinear layers to extract latent features from inputs and fit the complicated interactions of useritem and headrelation pairs. Since the two tasks are not independent but connected by items and entities, we design a crosscompress unit in MKR to associate the two tasks, which can automatically learn highorder interactions of item and entity features and transfer knowledge between the two tasks. We conduct extensive experiments in four recommendation scenarios. The results demonstrate the significant superiority of MKR over strong baselines and the efficacy of the usage of KG.
For future work, we plan to investigate other types of neural networks (such as CNN) in MKR framework. We will also incorporate other KGE methods as the implementation of KGE module in MKR by redesigning the crosscompress unit.
Appendix
A Proof of Theorem 1
Proof.
We prove the theorem by induction:
Base case: When ,
Therefore, we have
It is clear that the cross terms about and with maximal degree is , so we have , and for . The proof for is similar.
Induction step: Suppose and hold for the maximaldegree term and in and . Since and , without loss of generosity, we assume that and exist in and , respectively. Then for , we have
Obviously, the maximaldegree term in is the cross term in . Since we have and for both and , the degree of cross term therefore satisfies and . The proof for is similar. ∎
B Proof of Proposition 2
Proof.
In the proof of Theorem 1 in Appendix A, we have shown that
It is easy to see that , , and . The proof is similar for . ∎
References
 (1)
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems. 2787–2795.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198.

Guo
et al. (2017)
Huifeng Guo, Ruiming
Tang, Yunming Ye, Zhenguo Li, and
Xiuqiang He. 2017.
DeepFM: A FactorizationMachine based Neural
Network for CTR Prediction. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
.  Han and Zhang (2015) Lei Han and Yu Zhang. 2015. Learning tree structure in multitask learning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 397–406.
 Han and Zhang (2016) Lei Han and Yu Zhang. 2016. MultiStage MultiTask Learning with Reduced Rank.. In AAAI. 1638–1644.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
 Huang et al. (2013) PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM. ACM, 2333–2338.
 Jamali and Ester (2010) Mohsen Jamali and Martin Ester. 2010. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the 4th ACM conference on Recommender systems. ACM, 135–142.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009).
 Lee et al. (2016) Giwoong Lee, Eunho Yang, and Sung Hwang. 2016. Asymmetric multitask learning based on task relatedness and loss. In International Conference on Machine Learning. 230–238.
 Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning Entity and Relation Embeddings for Knowledge Graph Completion.. In The 29th AAAI Conference on Artificial Intelligence. 2181–2187.
 Liu et al. (2017) Hanxiao Liu, Yuexin Wu, and Yiming Yang. 2017. Analogical Inference for MultiRelational Embeddings. In Proceedings of the 34th International Conference on Machine Learning. 2168–2178.
 Long et al. (2017) Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. 2017. Learning Multiple Tasks with Multilinear Relationship Networks. In Advances in Neural Information Processing Systems. 1593–1602.
 McDonald et al. (2014) Andrew M McDonald, Massimiliano Pontil, and Dimitris Stamos. 2014. Spectral ksupport norm regularization. In Advances in Neural Information Processing Systems. 3644–3652.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.

Misra
et al. (2016)
Ishan Misra, Abhinav
Shrivastava, Abhinav Gupta, and Martial
Hebert. 2016.
Crossstitch networks for multitask learning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 3994–4003.  Nickel et al. (2016) Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. 2016. Holographic Embeddings of Knowledge Graphs.. In The 30th AAAI Conference on Artificial Intelligence. 1955–1961.
 Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. 2011. A ThreeWay Model for Collective Learning on MultiRelational Data. In Proceedings of the 28th International Conference on Machine Learning. 809–816.
 Pan et al. (2010) Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
 Rendle (2010) Steffen Rendle. 2010. Factorization machines. In Proceedings of the 10th IEEE International Conference on Data Mining. IEEE, 995–1000.
 Rendle (2012) Steffen Rendle. 2012. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 3 (2012), 57.
 Rocktäschel et al. (2015) Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. 2015. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1119–1129.
 Rudin et al. (1964) Walter Rudin et al. 1964. Principles of mathematical analysis. Vol. 3. McGrawhill New York.
 Tang et al. (2012) Jie Tang, Sen Wu, Jimeng Sun, and Hang Su. 2012. Crossdomain collaboration recommendation. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1285–1293.
 Wang et al. (2018a) Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018a. Graphgan: Graph representation learning with generative adversarial nets. In AAAI. 2508–2515.
 Wang et al. (2017c) Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, and Minyi Guo. 2017c. Joint TopicSemanticaware Social Recommendation for Online Voting. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 347–356.
 Wang et al. (2015) Hao Wang, Naiyan Wang, and DitYan Yeung. 2015. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235–1244.
 Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Min Hou, Xing Xie, Minyi Guo, and Qi Liu. 2018b. Shine: Signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 592–600.
 Wang et al. (2018c) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018c. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM.
 Wang et al. (2018d) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018d. DKN: Deep KnowledgeAware Network for News Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1835–1844.
 Wang et al. (2017b) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017b. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2724–2743.
 Wang et al. (2017a) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017a. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17. ACM, 12.

Wang
et al. (2014)
Zhen Wang, Jianwen Zhang,
Jianlin Feng, and Zheng Chen.
2014.
Knowledge graph and text jointly embedding. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
. 1591–1601.  Xie et al. (2016) Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016. Representation Learning of Knowledge Graphs with Hierarchical Types.. In IJCAI. 2965–2971.
 Xue et al. (2007) Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. 2007. Multitask learning for classification with dirichlet process priors. Journal of Machine Learning Research 8, Jan (2007), 35–63.
 Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems. 3320–3328.
 Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 283–292.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
 Zhang et al. (2015) Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, and Shuiwang Ji. 2015. Deep model based transfer and multitask learning for biological image analysis. In 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015. Association for Computing Machinery.
 Zhang and Yang (2017) Yu Zhang and Qiang Yang. 2017. A survey on multitask learning. arXiv preprint arXiv:1707.08114 (2017).
 Zhang and Yeung (2012) Yu Zhang and DitYan Yeung. 2012. A convex formulation for learning task relationships in multitask learning. arXiv preprint arXiv:1203.3536 (2012).
 Zhang and Yeung (2014) Yu Zhang and DitYan Yeung. 2014. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 8, 3 (2014), 12.
 Zhao et al. (2017) Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 635–644.
 Zhong et al. (2015) Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, and Zheng Chen. 2015. Aligning knowledge and text embeddings by entity descriptions. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 267–272.
 Zhou and Zhao (2016) Qiang Zhou and Qi Zhao. 2016. Flexible Clustered MultiTask Learning by Learning Representative Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2 (2016), 266–278.
Comments
There are no comments yet.