1. Introduction
Recommender systems are widely used in Internet applications to meet user’s personalized interests and alleviate the information overload (Covington et al., 2016; Wang et al., 2018a; Ying et al., 2018). Traditional recommender systems that are based on collaborative filtering (Koren et al., 2009; Wang et al., 2017b) usually suffer from the coldstart problem and have trouble recommending brand new items that have not yet been heavily explored by the users. The sparsity issue can be addressed by introducing additional sources of information such as user/item profiles (Wang et al., 2018b) or social networks (Wang et al., 2017b).
Knowledge graphs (KGs) capture structured information and relations between a set of entities (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018; Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019b, c; Wang et al., 2019a). KGs are heterogeneous graphs in which nodes correspond to entities (e.g., items or products, as well as their properties and characteristics) and edges correspond to relations. KGs provide connectivity information between items via different types of relations and thus capture semantic relatedness between the items.
The core challenge in utilizing KGs in recommender systems is to learn how to capture userspecific
itemitem relatedness captured by the KG. Existing KGaware recommender systems can be classified into pathbased methods
(Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018), embeddingbased methods (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Wang et al., 2019b), and hybrid methods (Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019c). However, these approaches rely on manual feature engineering, are unable to perform endtoend training, and have poor scalability. Graph Neural Networks (GNNs), which aggregate node feature information from node’s local network neighborhood using neural networks, represent a promising advancement in graphbased representation learning (Bruna et al., 2014; Defferrard et al., 2016; Kipf and Welling, 2017; Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017). Recently, several works developed GNNs architecture for recommender systems (Ying et al., 2018; Monti et al., 2017; van den Berg et al., 2017; Wu et al., 2018; Wang et al., 2019c), but these approaches are mostly designed for homogeneous bipartite useritem interaction graphs or user/itemsimilarity graphs. It remains an open question how to extend GNNs architecture to heterogeneous knowledge graphs.In this paper, we develop Knowledgeaware Graph Neural Networks with Label Smoothness regularization (KGNNLS) that extends GNNs architecture to knowledge graphs to simultaneously capture semantic relationships between the items as well as personalized user preferences and interests. To account for the relational heterogeneity in KGs, similar to (Wang et al., 2019c), we use a trainable and personalized relation scoring function that transforms the KG into a userspecific weighted graph, which characterizes both the semantic information of the KG as well as user’s personalized interests. For example, in the movie recommendation setting the relation scoring function could learn that a given user really cares about “director” relation between movies and persons, while somebody else may care more about the “lead actor” relation. Using this personalized weighted graph, we then apply a graph neural network that for every item node computes its embedding by aggregating node feature information over the local network neighborhood of the item node. This way the embedding of each item captures it’s local KG structure in a userpersonalized way.
A significant difference between our approach and traditional GNNs is that the edge weights in the graph are not given as input. We set them using userspecific relation scoring function that is trained in a supervised fashion. However, the added flexibility of edge weights makes the learning process prone to overfitting, since the only source of supervised signal for the relation scoring function is coming from useritem interactions (which are sparse in general). To remedy this problem, we develop a technique for regularization of edge weights during the learning process, which leads to better generalization. We develop an approach based on label smoothness (Zhu et al., 2003; Zhang and Lee, 2007), which assumes that adjacent entities in the KG are likely to have similar user relevancy labels/scores. In our context this assumption means that users tend to have similar preferences to items that are nearby in the KG. We prove that label smoothness regularization is equivalent to label propagation and we design a leaveoneoutloss function for label propagation to provide extra supervised signal for learning the edge scoring function. We show that the knowledgeaware graph neural networks and label smoothness regularization can be unified under the same framework, where label smoothness can be seen as a natural choice of regularization on knowledgeaware graph neural networks.
We apply the proposed method to four realworld datasets of movie, book, music, and restaurant recommendations, in which the first three datasets are public datasets and the last is from MeituanDianping Group. Experiments show that our method achieves significant gains over stateoftheart methods in recommendation accuracy. We also show that our method maintains strong recommendation performance in the coldstart scenarios where useritem interactions are sparse.
2. Related Work
2.1. Graph Neural Networks
Graph Neural Networks (or Graph Convolutional Neural Networks, GCNs) aim to generalize convolutional neural networks to nonEuclidean domains (such as graphs) for robust feature learning. Bruna et al.
(Bruna et al., 2014) define the convolution in Fourier domain and calculate the eigendecomposition of the graph Laplacian, Defferrard et al. (Defferrard et al., 2016) approximate the convolutional filters by Chebyshev expansion of the graph Laplacian, and Kipf et al. (Kipf and Welling, 2017) propose a convolutional architecture via a firstorder approximation. In contrast to these spectral GCNs, nonspectral GCNs operate on the graph directly and apply “convolution” (i.e., weighted average) to local neighbors of a node (Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017).Recently, researchers also deployed GNNs in recommender systems: PinSage (Ying et al., 2018) applies GNNs to the pinboard bipartite graph in Pinterest. Monti et al. (Monti et al., 2017) and Berg et al. (van den Berg et al., 2017) model recommender systems as matrix completion and design GNNs for representation learning on useritem bipartite graphs. Wu et al. (Wu et al., 2018) use GNNs on user/item structure graphs to learn user/item representations. The difference between these works and ours is that they are all designed for homogeneous bipartite graphs or user/itemsimilarity graphs where GNNs can be used directly, while here we investigate GNNs for heterogeneous KGs. Wang et al. (Wang et al., 2019c) use GCNs in KGs for recommendation, but simply applying GCNs to KGs without proper regularization is prone to overfitting and leads to performance degradation as we will show later. Schlichtkrull et al. also propose using GNNs to model KGs (Schlichtkrull et al., 2018), but not for the purpose of recommendations.
2.2. Semisupervised Learning on Graphs
The goal of graphbased semisupervised learning is to correctly label all nodes in a graph given that only a few nodes are labeled. Prior work often makes assumptions on the distribution of labels over the graph, and one common assumption is smooth variation of labels of nodes across the graph. Based on different settings of edge weights in the input graph, these methods are classified as: (1) Edge weights are assumed to be given as input and therefore fixed
(Zhu et al., 2003; Zhou et al., 2004; Baluja et al., 2008); (2) Edge weights are parameterized and therefore learnable (Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013). Inspired by these methods, we design a module of label smoothness regularization in our proposed model. The major distinction of our work is that the label smoothness constraint is not used for semisupervised learning on graphs, but serves as regularization to assist the learning of edge weights and achieves better generalization for recommender systems.2.3. Recommendations with Knowledge Graphs
In general, existing KGaware recommender systems can be classified into three categories: (1) Embeddingbased methods (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Wang et al., 2019b) preprocess a KG with knowledge graph embedding (KGE) (Wang et al., 2017a) algorithms, then incorporate learned entity embeddings into recommendation. Embeddingbased methods are highly flexible in utilizing KGs to assist recommender systems, but the KGE algorithms focus more on modeling rigorous semantic relatedness (e.g., TransE (Bordes et al., 2013) assumes ), which are more suitable for graph applications such as link prediction rather than recommendations. In addition, embeddingbased methods usually lack an endtoend way of training. (2) Pathbased methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018) explore various patterns of connections among items in a KG (a.k.a metapath or metagraph) to provide additional guidance for recommendations. Pathbased methods make use of KGs in a more intuitive way, but they rely heavily on manually designed metapaths/metagraphs, which are hard to tune in practice. (3) Hybrid methods (Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019c) combine the above two categories and learn user/item embeddings by exploiting the structure of KGs. Our proposed model can be seen as an instance of hybrid methods.
3. Problem Formulation
We begin by describing the KGaware recommendations problem and introducing notation. In a typical recommendation scenario, we have a set of users and a set of items . The useritem interaction matrix is defined according to users’ implicit feedback, where indicates that user has engaged with item , such as clicking, watching, or purchasing. We also have a knowledge graph available, in which , , and denote the head, relation, and tail of a knowledge triple, and are the set of entities and relations in the knowledge graph, respectively. For example, the triple (The Silence of the Lambs, film.film.star, Anthony Hopkins) states the fact that Anthony Hopkins is the leading actor in film “The Silence of the Lambs”. In many recommendation scenarios, an item corresponds to an entity (e.g., item “The Silence of the Lambs” in MovieLens also appears in the knowledge graph as an entity). The set of entities is composed from items () as well as nonitems (e.g. nodes corresponding to item/product properties). Given useritem interaction matrix and knowledge graph , our task is to predict whether user has potential interest in item with which he/she has not engaged before. Specifically, we aim to learn a prediction function , where
denotes the probability that user
will engage with item , and are model parameters of function .We list the key symbols used in this paper in Table 1.
Symbol  Meaning 

Set of users  
Set of items  
Useritem interaction matrix  
Knowledge graph  
Set of entities  
Set of relations  
Set of nonitem entities  
Userspecific relation scoring function  
Adjacency matrix of w.r.t. user  
Diagonal degree matrix of  
Raw entity feature  
Entity representation in the th layer  
Transformation matrix in the th layer  
Item relevancy labeling function  
Minimumenergy labeling function  
Predicted relevancy label for item  
Label smoothness regularization on 
4. Our Approach
In this section, we first introduce knowledgeaware graph neural networks and label smoothness regularization, respectively, then we present the unified model.
4.1. Preliminaries: Knowledgeaware Graph Neural Networks
The first step of our approach is to transform a heterogeneous KG into a userpersonalized weighted graph, which characterizes user’s preferences. To this end, similar to (Wang et al., 2019c), we use a userspecific relation scoring function that provides the importance of relation for user : , where and
are feature vectors of user
and relation type , respectively, and is a differentiable function such as inner product. Intuitively, characterizes the importance of relation to user . For example, a user may be more interested in movies that have the same director with the movies he/she watched before, but another user may care more about the leading actor of movies.Given userspecific relation scoring function of user , knowledge graph can therefore be transformed into a userspecific adjacency matrix , in which the entry , and is the relation between entities and in .^{1}^{1}1In this work we treat an undirected graph, so is a symmetric matrix. If both triples and exist, we only consider one of and . This is due to the fact that: (1) and are the inverse of each other and semantically related; (2) Treating symmetric will greatly increase the matrix density. if there is no relation between and . See the left two subfigures in Figure 1 for illustration. We also denote the raw feature matrix of entities as , where is the dimension of raw entity features. Then we use multiple feed forward layers to update the entity representation matrix by aggregating representations of neighboring entities. Specifically, the layerwise forward propagation can be expressed as
(1) 
In Eq. (1),
is the matrix of hidden representations of entities in layer
, and . is to aggregate representation vectors of neighboring entities. In this paper, we set , i.e., adding selfconnection to each entity, to ensure that old representation vector of the entity itself is taken into consideration when updating entity representations. is a diagonal degree matrix with entries , therefore, is used to normalize and keep the entity representation matrix stable. is the layerspecific trainable weight matrix,is a nonlinear activation function, and
is the number of layers.A single GNN layer computes the representation of an entity via a transformed mixture of itself and its immediate neighbors in the KG. We can therefore naturally extend the model to multiple layers to explore users’ potential interests in a broader and deeper way. The final output is , which is the entity representations that mix the initial features of themselves and their neighbors up to hops away. Finally, the predicted engagement probability of user with item is calculated by , where (i.e., the th row of ) is the final representation vector of item , and
is a differentiable prediction function, for example, inner product or a multilayer perceptron. Note that
is userspecific since the adjacency matrix is userspecific. Furthermore, note that the system is endtoend trainable where the gradients flow from via GNN (parameter matrix ) to and eventually to representations of users and items .4.2. Label Smoothness Regularization
It is worth noticing a significant difference between our model and GNNs: In traditional GNNs, edge weights of the input graph are fixed; but in our model, edge weights in Eq. (1) are learnable (including possible parameters of function and feature vectors of users and relations) and also requires supervised training like . Though enhancing the fitting ability of the model, this will inevitably make the optimization process prone to overfitting, since the only source of supervised signal is from useritem interactions outside GNN layers. Moreover, edge weights do play an essential role in representation learning on graphs, as highlighted by a large amount of prior works (Zhu et al., 2003; Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013; Velickovic et al., 2018). Therefore, more regularization on edge weights is needed to assist the learning of entity representations and to help generalize to unobserved interactions more efficiently.
Let’s see how an ideal set of edge weights should be like. Consider a realvalued label function on , which is constrained to take a specific value at node . In our context, if user finds the item relevant and has engaged with it, otherwise . Intuitively, we hope that adjacent entities in the KG are likely to have similar relevancy labels, which is known as label smoothness assumption. This motivates our choice of energy function :
(2) 
We show that the minimumenergy label function is harmonic by the following theorem:
Theorem 1 ().
Proof.
Taking the derivative of the following equation
with respect to where , we have
The minimumenergy label function should satisfy that
Therefore, we have
∎
The harmonic property indicates that the value of at each nonitem entity is the average of its neighboring entities, which leads to the following label propagation scheme (Zhu et al., 2005):
Theorem 2 ().
Repeating the following two steps:

Propagate labels for all entities: , where is the vector of labels for all entities;

Reset labels of all items to initial labels: , where is the vector of labels for all items and are initial labels;
will lead to .
Proof.
Let . Since is fixed on , we are only interested in . We denote (the subscript is omitted from for ease of notation), and partition matrix into submatrices according to the partition of :
Then the label propagation scheme is equivalent to
(5) 
Repeat the above procedure, we have
(6) 
where is the initial value for . Now we show that . Since is rownormalized and is a submatrix of , we have
for all possible row index . Therefore,
As goes infinity, the row sum of converges to zero, which implies that . It’s clear that the choice of initial value does not affect the convergence.
Theorem 2 provides a way for reaching the minimumenergy of relevancy label function . However, does not provide any signal for updating the edge weights matrix , since the labeled part of , i.e., , equals their true relevancy labels ; Moreover, we do not know true relevancy labels for the unlabeled nodes .
To solve the issue, we propose minimizing the leaveoneout loss (Zhang and Lee, 2007). Suppose we hold out a single item and treat it unlabeled. Then we predict its label by using the rest of (labeled) items and (unlabeled) nonitem entities. The prediction process is identical to label propagation in Theorem 2, except that the label of item is hidden and needs to be calculated. This way, the difference between the true relevancy label of (i.e., ) and the predicted label serves as a supervised signal for regularizing edge weights:
(7) 
where is the crossentropy loss function. Given the regularization in Eq. (7), an ideal edge weight matrix should reproduce the true relevancy label of each heldout item while also satisfying the smoothness of relevancy labels.
4.3. The Unified Loss Function
Combining knowledgeaware graph neural networks and LS regularization, we reach the following complete loss function:
(8) 
where is the L2regularizer, and are balancing hyperparameters. In Eq. (8), the first term corresponds to the part of GNN that learns the transformation matrix and edge weights simultaneously, while the second term corresponds to the part of label smoothness that can be seen as adding constraint on edge weights . Therefore, serves as regularization on to assist GNN in learning edge weights.
It is also worth noticing that the first term can be seen as feature propagation on the KG while the second term can be seen as label propagation on the KG. A recommender for a specific user is actually a mapping from item features to useritem interaction labels, i.e., where is the feature vector of item . Therefore, Eq. (8) utilizes the structural information of the KG on both the feature side and the label side of to capture users’ higherorder preferences.
4.4. Discussion
How can the knowledge graph help find users’ interests? To intuitively understand the role of the KG, we make an analogy with a physical equilibrium model as shown in Figure 2. Each entity/item is seen as a particle, while the supervised positive userrelevancy signal acts as the force pulling the observed positive items up from the decision boundary and the negative items signal acts as the force pushing the unobserved items down. Without the KG (Figure 1(a)), these items are only loosely connected with each other through the collaborative filtering effect (which is not drawn here for clarity). In contrast, edges in the KG serve as the rubber bands that impose explicit constraints on connected entities. When number of layers is (Figure 1(b)), representation of each entity is a mixture of itself and its immediate neighbors, therefore, optimizing on the positive items will simultaneously pull their immediate neighbors up together. The upward force goes deeper in the KG with the increase of (Figure 1(c)), which helps explore users’ longdistance interests and pull up more positive items. It is also interesting to note that the proximity constraint exerted by the KG is personalized since the strength of the rubber band (i.e., ) is userspecific and relationspecific: One user may prefer relation (Figure 1(b)) while another user (with same observed items but different unobserved items) may prefer relation (Figure 1(d)).
Despite the force exerted by edges in the KG, edge weights may be set inappropriately, for example, too small to pull up the unobserved items (i.e., rubber bands are too weak). Next, we show by Figure 1(e) that how the label smoothness assumption helps regularizing the learning of edge weights. Suppose we hold out the positive sample in the upper left and we intend to reproduce its label by the rest of items. Since the true relevancy label of the heldout sample is 1 and the upper right sample has the largest label value, the LS regularization term would enforce the edges with arrows to be large so that the label can “flow” from the blue one to the striped one as much as possible. As a result, this will tighten the rubber bands (denoted by arrows) and encourage the model to pull up the two upper pink items to a greater extent.
Movie  Book  Music  Restaurant  
# users  138,159  19,676  1,872  2,298,698 
# items  16,954  20,003  3,846  1,362 
# interactions  13,501,622  172,576  42,346  23,416,418 
# entities  102,569  25,787  9,366  28,115 
# relations  32  18  60  7 
# KG triples  499,474  60,787  15,518  160,519 
5. Experiments
In this section, we evaluate the proposed KGNNLS model, and present its performance on four realworld scenarios: movie, book, music, and restaurant recommendations.
5.1. Datasets
We utilize the following four datasets in our experiments for movie, book, music, and restaurant recommendations, respectively, in which the first three are public datasets and the last one is from MeituanDianping Group. We use Satori^{2}^{2}2https://searchengineland.com/library/bing/bingsatori, a commercial KG built by Microsoft, to construct subKGs for MovieLens20M, BookCrossing, and Last.FM datasets. The KG for DianpingFood dataset is constructed by the internal toolkit of MeituanDianping Group. Further details of datasets are provided in Appendix A.

MovieLens20M^{3}^{3}3https://grouplens.org/datasets/movielens/ is a widely used benchmark dataset in movie recommendations, which consists of approximately 20 million explicit ratings (ranging from 1 to 5) on the MovieLens website. The corresponding KG contains 102,569 entities, 499,474 edges and 32 relationtypes.

BookCrossing^{4}^{4}4http://www2.informatik.unifreiburg.de/~cziegler/BX/ contains 1 million ratings (ranging from 0 to 10) of books in the BookCrossing community. The corresponding KG contains 25,787 entities, 60,787 edges and 18 relationtypes.

Last.FM^{5}^{5}5https://grouplens.org/datasets/hetrec2011/ contains musician listening information from a set of 2 thousand users from Last.fm online music system. The corresponding KG contains 9,366 entities, 15,518 edges and 60 relationtypes.

DianpingFood is provided by Dianping.com^{6}^{6}6https://www.dianping.com/, which contains over 10 million interactions (including clicking, buying, and adding to favorites) between approximately 2 million users and 1 thousand restaurants. The corresponding KG contains 28,115 entities, 160,519 edges and 7 relationtypes.
The statistics of the four datasets are shown in Table 2.
Model  MovieLens20M  BookCrossing  Last.FM  DianpingFood  

R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  
SVD  0.036  0.124  0.277  0.401  0.027  0.046  0.077  0.109  0.029  0.098  0.240  0.332  0.039  0.152  0.329  0.451 
LibFM  0.039  0.121  0.271  0.388  0.033  0.062  0.092  0.124  0.030  0.103  0.263  0.330  0.043  0.156  0.332  0.448 
LibFM + TransE  0.041  0.125  0.280  0.396  0.037  0.064  0.097  0.130  0.032  0.102  0.259  0.326  0.044  0.161  0.343  0.455 
PER  0.022  0.077  0.160  0.243  0.022  0.041  0.064  0.070  0.014  0.052  0.116  0.176  0.023  0.102  0.256  0.354 
CKE  0.034  0.107  0.244  0.322  0.028  0.051  0.079  0.112  0.023  0.070  0.180  0.296  0.034  0.138  0.305  0.437 
RippleNet  0.045  0.130  0.278  0.447  0.036  0.074  0.107  0.127  0.032  0.101  0.242  0.336  0.040  0.155  0.328  0.440 
KGNNLS  0.043  0.155  0.321  0.458  0.045  0.082  0.117  0.149  0.044  0.122  0.277  0.370  0.047  0.170  0.340  0.487 
Model  Movie  Book  Music  Restaurant 

SVD  0.963  0.672  0.769  0.838 
LibFM  0.959  0.691  0.778  0.837 
LibFM + TransE  0.966  0.698  0.777  0.839 
PER  0.832  0.617  0.633  0.746 
CKE  0.924  0.677  0.744  0.802 
RippleNet  0.960  0.727  0.770  0.833 
KGNNLS  0.979  0.744  0.803  0.850 
5.2. Baselines
We compare the proposed KGNNLS model with the following baselines for recommender systems, in which the first two baselines are KGfree while the rest are all KGaware methods. The hyperparameter setting of KGNNLS is provided in Appendix B.

SVD (Koren, 2008) is a classic CFbased model using inner product to model useritem interactions. We use the unbiased version (i.e., the predicted engaging probability is modeled as ). The dimension and learning rate for the four datasets are set as: , for MovieLens20M, BookCrossing; , for Last.FM; , for DianpingFood.

LibFM + TransE extends LibFM by attaching an entity representation learned by TransE (Bordes et al., 2013) to each useritem pair. The dimension of TransE is for all datasets.

PER (Yu et al., 2014) is a representative of pathbased methods, which treats the KG as heterogeneous information networks and extracts metapath based features to represent the connectivity between users and items. We use manually designed “useritemattributeitem” as metapaths, i.e., “usermoviedirectormovie”, “usermoviegenremovie”, and “usermoviestarmovie” for MovieLens20M; “userbookauthorbook” and “userbookgenrebook” for BookCrossing, “usermusiciandate_of_birthmusician” (date of birth is discretized), “usermusiciancountrymusician”, and “usermusiciangenremusician” for Last.FM; “userrestaurantdishrestaurant”, “userrestaurantbusiness_arearestaurant”, “userrestauranttagrestaurant” for DianpingFood. The settings of dimension and learning rate are the same as SVD.

CKE (Zhang et al., 2016) is a representative of embeddingbased methods, which combines CF with structural, textual, and visual knowledge in a unified framework. We implement CKE as CF plus a structural knowledge module in this paper. The dimension of embedding for the four datasets are , , , . The training weight for KG part is for all datasets. The learning rate are the same as in SVD.

RippleNet (Wang et al., 2018c) is a representative of hybrid methods, which is a memorynetworklike approach that propagates users’ preferences on the KG for recommendation. For RippleNet, , , , , for MovieLens20M; , , , , for Last.FM; , , , , for DianpingFood.
5.3. Validating the Connection between and
To validate the connection between the knowledge graph and useritem interaction , we conduct an empirical study where we investigate the correlation between the shortest path distance of two randomly sampled items in the KG and whether they have common user(s) in the dataset, that is there exist user(s) that interacted with both items. For MovieLens20M and Last.FM, we randomly sample ten thousand item pairs that have no common users and have at least one common user, respectively, then count the distribution of their shortest path distances in the KG. The results are presented in Figure 3, which clearly show that if two items have common user(s) in the dataset, they are likely to be more close in the KG. For example, if two movies have common user(s) in MovieLens20M, there is a probability of that they will be within 2 hops in the KG, while the probability is if they have no common user. This finding empirically demonstrates that exploiting the proximity structure of the KG can assist making recommendations. This also justifies our motivation to use label smoothness regularization to help learn entity representations.
5.4. Results
5.4.1. Comparison with Baselines
We evaluate our method in two experiment scenarios: (1) In top recommendation, we use the trained model to select items with highest predicted click probability for each user in the test set, and choose to evaluate the recommended sets. (2) In clickthrough rate (CTR) prediction, we apply the trained model to predict each piece of useritem pair in the test set (including positive items and randomly selected negative items). We use
as the evaluation metric in CTR prediction.
The results of top recommendation and CTR prediction are presented in Tables 3 and 4, respectively, which show that KGNNLS outperforms baselines by a significant margin. For example, the of KGNNLS surpasses baselines by , , , and on average in MovieLens20M, BookCrossing, Last.FM, and DianpingFood datasets, respectively.
We also show daily performance of KGNNLS and baselines on DianpingFood to investigate performance stability. Figure 6 shows their
score from September 1, 2018 to September 30, 2018. We notice that the curve of KGNNLS is consistently above baselines over the test period; Moreover, the performance of KGNNLS is also with low variance, which suggests that KGNNLS is also robust and stable in practice.
5.4.2. Effectiveness of LS Regularization
Is the proposed LS regularization helpful in improving the performance of GNN? To study the effectiveness of LS regularization, we fix the dimension of hidden layers as , , and , then vary from to to see how performance changes. The results of in Last.FM dataset are plotted in Figure 6. It is clear that the performance of KGNNLS with a nonzero is better than (the case of Wang et al. (Wang et al., 2019c)), which justifies our claim that LS regularization can assist learning the edge weights in a KG and achieve better generalization in recommender systems. But note that a too large is less favorable, since it overwhelms the overall loss and misleads the direction of gradients. According to the experiment results, we find that a between and is preferable in most cases.
5.4.3. Results in coldstart scenarios
SVD  0.882  0.913  0.938  0.955  0.963 

LibFM  0.902  0.923  0.938  0.950  0.959 
LibFM+TransE  0.914  0.935  0.949  0.960  0.966 
PER  0.802  0.814  0.821  0.828  0.832 
CKE  0.898  0.910  0.916  0.921  0.924 
RippleNet  0.921  0.937  0.947  0.955  0.960 
KGNNLS  0.961  0.970  0.974  0.977  0.979 
One major goal of using KGs in recommender systems is to alleviate the sparsity issue. To investigate the performance of KGNNLS in coldstart scenarios, we vary the size of training set of MovieLens20M from to (while the validation and test set are kept fixed), and report the results of in Table 5. When , decreases by , , , , , and for the six baselines compared to the model trained on full training data (), but the performance decrease of KGNNLS is only . This demonstrates that KGNNLS still maintains predictive performance even when useritem interactions are sparse.
5.4.4. Hyperparameters Sensitivity
We first analyze the sensitivity of KGNNLS to the number of GNN layers . We vary from to while keeping other hyperparameters fixed. The results are shown in Table 6. We find that the model performs poorly when , which is because a larger will mix too many entity embeddings in a given entity, which oversmoothes the representation learning on KGs. KGNNLS achieves the best performance when or in the four datasets.
1  2  3  4  

MovieLens20M  0.155  0.146  0.122  0.011 
BookCrossing  0.077  0.082  0.043  0.008 
Last.FM  0.122  0.106  0.105  0.057 
DianpingFood  0.165  0.170  0.061  0.036 
We also examine the impact of the dimension of hidden layers on the performance of KGNNLS. The result in shown in Table 7. We observe that the performance is boosted with the increase of at the beginning, because more bits in hidden layers can improve the model capacity. However, the performance drops when further increases, since a too large dimension may overfit datasets. The best performance is achieved when .
4  8  16  32  64  128  

MovieLens20M  0.134  0.141  0.143  0.155  0.155  0.151 
BookCrossing  0.065  0.073  0.077  0.081  0.082  0.080 
Last.FM  0.111  0.116  0.122  0.109  0.102  0.107 
DianpingFood  0.155  0.170  0.167  0.166  0.163  0.161 
5.5. Running Time Analysis
We also investigate the running time of our method with respect to the size of KG. We run experiments on a Microsoft Azure virtual machine with 1 NVIDIA Tesla M60 GPU, 12 Intel Xeon CPUs (E52690 v3 @2.60GHz), and 128GB of RAM. The size of the KG is increased by up to five times the original one by extracting more triples from Satori, and the running times of all methods on MovieLens20M are reported in Figure 6. Note that the trend of a curve matters more than the real values, since the values are largely dependent on the minibatch size and the number of epochs (yet we did try to align the configurations of all methods). The result show that KGNNLS exhibits strong scalability even when the KG is large.
6. Conclusion and Future Work
In this paper, we propose knowledgeaware graph neural networks with label smoothness regularization for recommendation. KGNNLS applies GNN architecture to KGs by using userspecific relation scoring functions and aggregating neighborhood information with different weights. In addition, the proposed label smoothness constraint and leaveoneout loss provide strong regularization for learning the edge weights in KGs. We also discuss how KGs benefit recommender systems and how label smoothness can assist learning the edge weights. Experiment results show that KGNNLS outperforms stateoftheart baselines in four recommendation scenarios and achieves desirable scalability with respect to KG size.
In this paper, LS regularization is proposed for recommendation task with KGs. It is interesting to examine the LS assumption on other graph tasks such as link prediction and node classification. Investigating the theoretical relationship between feature propagation and label propagation is also a promising direction.
Acknowledgements
. This research has been supported in part by NSF OAC1835598, DARPA MCS, ARO MURI, Boeing, Docomo, Hitachi, Huawei, JD, Siemens, and Stanford Data Science Initiative.
References
 (1)
 Baluja et al. (2008) Shumeet Baluja, Rohan Seth, D Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. 2008. Video suggestion and discovery for youtube: taking random walks through the view graph. In Proceedings of the 17th international conference on World Wide Web. ACM, 895–904.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems. 2787–2795.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In the 2nd International Conference on Learning Representations.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. ACM, 191–198.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems. 2224–2232.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.

Hu
et al. (2018)
Binbin Hu, Chuan Shi,
Wayne Xin Zhao, and Philip S Yu.
2018.
Leveraging metapath based context for topn recommendation with a neural coattention model. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1531–1540.  Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, JiRong Wen, and Edward Y Chang. 2018. Improving Sequential Recommendation with KnowledgeEnhanced Memory Networks. In the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 505–514.
 Karasuyama and Mamitsuka (2013) Masayuki Karasuyama and Hiroshi Mamitsuka. 2013. Manifoldbased similarity adaptation for label propagation. In Advances in neural information processing systems. 1547–1555.
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. In the 5th International Conference on Learning Representations.
 Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
 Monti et al. (2017) Federico Monti, Michael Bronstein, and Xavier Bresson. 2017. Geometric matrix completion with recurrent multigraph neural networks. In Advances in Neural Information Processing Systems. 3697–3707.

Niepert
et al. (2016)
Mathias Niepert, Mohamed
Ahmed, and Konstantin Kutzkov.
2016.
Learning convolutional neural networks for graphs.
In
International Conference on Machine Learning
. 2014–2023.  Rendle (2012) Steffen Rendle. 2012. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 3 (2012), 57.
 Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
 Sun et al. (2018) Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, LongKai Huang, and Chi Xu. 2018. Recurrent knowledge graph embedding for effective recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 297–305.
 van den Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. stat 1050 (2017), 7.
 Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conferences on Learning Representations.
 Wang and Zhang (2008) Fei Wang and Changshui Zhang. 2008. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 1 (2008), 55–67.
 Wang et al. (2017b) Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, and Minyi Guo. 2017b. Joint topicsemanticaware social recommendation for online voting. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 347–356.
 Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Min Hou, Xing Xie, Minyi Guo, and Qi Liu. 2018b. Shine: Signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 592–600.
 Wang et al. (2018c) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018c. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 417–426.
 Wang et al. (2019a) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019a. Exploring HighOrder User Preference on the Knowledge Graph for Recommender Systems. ACM Transactions on Information Systems (TOIS) 37, 3 (2019), 32.
 Wang et al. (2018d) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018d. DKN: Deep KnowledgeAware Network for News Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. 1835–1844.
 Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. MultiTask Feature Learning for Knowledge Graph Enhanced Recommendation. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
 Wang et al. (2019c) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019c. Knowledge graph convolutional networks for recommender systems. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
 Wang et al. (2018a) Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018a. Billionscale commodity embedding for ecommerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 839–848.
 Wang et al. (2017a) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017a. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2724–2743.

Wu et al. (2018)
Yuexin Wu, Hanxiao Liu,
and Yiming Yang. 2018.
Graph Convolutional Matrix Completion for Bipartite
Edge Prediction. In
the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
.  Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for WebScale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 974–983.
 Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. ACM, 283–292.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
 Zhang and Lee (2007) Xinhua Zhang and Wee S Lee. 2007. Hyperparameter learning for graph based semisupervised learning algorithms. In Advances in neural information processing systems. 1585–1592.
 Zhao et al. (2017) Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 635–644.
 Zhou et al. (2004) Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. 2004. Learning with local and global consistency. In Advances in neural information processing systems. 321–328.
 Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning. 912–919.
 Zhu et al. (2005) Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. 2005. Semisupervised learning with graphs. Ph.D. Dissertation. Carnegie Mellon University, language technologies institute, school of computer science.
Appendix
A Additional Details on Datasets
MovieLens20M, BookCrossing, and Last.FM dataset contain explicit feedbacks data (Last.FM provides the listening count as weight for each useritem interaction). Therefore, we transform them into implicit feedback, where each entry is marked with 1 indicating that the user has rated the item positively. The threshold of positive rating is 4 for MovieLens20M, while no threshold is set for BookCrossing and Last.FM due to their sparsity. Additionally, we randomly sample an unwatched set of items and mark them as 0 for each user, the number of which equals his/her positivelyrated ones.
We use Microsoft Satori to construct the KGs for MovieLens20M, BookCrossing, and Last.FM dataset. In one triple in Satori KG, the head and tail are either IDs or textual content, and the relation is with the form “domain.head_category.tail_category” (e.g., “book.book.author”). We first select a subset of triples from the whole Satori KG with a confidence level greater than 0.9. Given the subKG, we collect Satori IDs of all valid movies/books/musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), for the three datasets. Items with multiple matched or no matched entities are excluded for simplicity. After having the set of item IDs, we match these item IDs with the head of all triples in Satori subKG, and select all wellmatched triples as the final KG for each dataset.
DianpingFood dataset is collected from Dianping.com, a Chinese group buying website hosting consumer reviews of restaurants similar to Yelp. We select approximately 10 million interactions between users and restaurants in Dianping.com from May 1, 2015 to December 12, 2018. The types of positive interactions include clicking, buying, and adding to favorites, and we sample negative interactions for each user. The KG for DianpingFood is collected from Meituan Brain, an internal knowledge graph built for dining and entertainment by MeituanDianping Group. The types of entities include POI (restaurant), city, firstlevel and secondlevel category, star, business area, dish, and tag; The types of relations correspond to the types of entities (e.g., “organization.POI.has_dish”).
B Additional Details on Hyperparameter Searching
In KGNNLS, we set functions and as inner product, as ReLU for nonlastlayers and for the lastlayer. Note that the size of neighbors of an entity in a KG may vary significantly over the KG. To keep the computation more efficient, we uniformly sample a fixedsize set of neighbors for each entity instead of using its full set of neighbors. The number of sampled neighbors for each entity is denoted by . Hyperparameter settings are given in Table 8, which are determined by optimizing on a validation set. The search spaces for hyperparameters are as follows:

;

;

;

;

;

.
For each dataset, the ratio of training, validation, and test set is . Each experiment is repeated
times, and the average performance is reported. All trainable parameters are optimized by Adam algorithm. The code of KGNNLS is implemented with Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.
Movie  Book  Music  Restaurant  
16  8  8  4  
32  64  16  8  
1  2  1  2  
1.0  0.5  0.1  0.5  
Comments
There are no comments yet.