KGNN-LS
A tensorflow implementation of Knowledge-aware Graph Neural Networks with Label Smoothness regularization
view repo
Knowledge graphs capture interlinked information between entities and they represent an attractive source of structured information that can be harnessed for recommender systems. However, existing recommender engines use knowledge graphs by manually designing features, do not allow for end-to-end training, or provide poor scalability. Here we propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end trainable framework that harnesses item relationships captured by the knowledge graph to provide better recommendations. Conceptually, KGCN computes user-specific item embeddings by first applying a trainable function that identifies important knowledge graph relations for a given user and then transforming the knowledge graph into a user-specific weighted graph. Then, KGCN applies a graph convolutional neural network that computes an embedding of an item node by propagating and aggregating knowledge graph neighborhood information. Moreover, to provide better inductive bias KGCN uses label smoothness (LS), which provides regularization over edge weights and we prove that it is equivalent to label propagation scheme on a graph. Finally, We unify KGCN and LS regularization, and present a scalable minibatch implementation for KGCN-LS model. Experiments show that KGCN-LS outperforms strong baselines in four datasets. KGCN-LS also achieves great performance in sparse scenarios and is highly scalable with respect to the knowledge graph size.
READ FULL TEXT VIEW PDFA tensorflow implementation of Knowledge-aware Graph Neural Networks with Label Smoothness regularization
None
Recommender systems are widely used in Internet applications to meet user’s personalized interests and alleviate the information overload (Covington et al., 2016; Wang et al., 2018a; Ying et al., 2018). Traditional recommender systems that are based on collaborative filtering (Koren et al., 2009; Wang et al., 2017b) usually suffer from the cold-start problem and have trouble recommending brand new items that have not yet been heavily explored by the users. The sparsity issue can be addressed by introducing additional sources of information such as user/item profiles (Wang et al., 2018b) or social networks (Wang et al., 2017b).
Knowledge graphs (KGs) capture structured information and relations between a set of entities (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018; Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019b, c; Wang et al., 2019a). KGs are heterogeneous graphs in which nodes correspond to entities (e.g., items or products, as well as their properties and characteristics) and edges correspond to relations. KGs provide connectivity information between items via different types of relations and thus capture semantic relatedness between the items.
The core challenge in utilizing KGs in recommender systems is to learn how to capture user-specific
item-item relatedness captured by the KG. Existing KG-aware recommender systems can be classified into path-based methods
(Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018), embedding-based methods (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Wang et al., 2019b), and hybrid methods (Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019c). However, these approaches rely on manual feature engineering, are unable to perform end-to-end training, and have poor scalability. Graph Neural Networks (GNNs), which aggregate node feature information from node’s local network neighborhood using neural networks, represent a promising advancement in graph-based representation learning (Bruna et al., 2014; Defferrard et al., 2016; Kipf and Welling, 2017; Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017). Recently, several works developed GNNs architecture for recommender systems (Ying et al., 2018; Monti et al., 2017; van den Berg et al., 2017; Wu et al., 2018; Wang et al., 2019c), but these approaches are mostly designed for homogeneous bipartite user-item interaction graphs or user-/item-similarity graphs. It remains an open question how to extend GNNs architecture to heterogeneous knowledge graphs.In this paper, we develop Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) that extends GNNs architecture to knowledge graphs to simultaneously capture semantic relationships between the items as well as personalized user preferences and interests. To account for the relational heterogeneity in KGs, similar to (Wang et al., 2019c), we use a trainable and personalized relation scoring function that transforms the KG into a user-specific weighted graph, which characterizes both the semantic information of the KG as well as user’s personalized interests. For example, in the movie recommendation setting the relation scoring function could learn that a given user really cares about “director” relation between movies and persons, while somebody else may care more about the “lead actor” relation. Using this personalized weighted graph, we then apply a graph neural network that for every item node computes its embedding by aggregating node feature information over the local network neighborhood of the item node. This way the embedding of each item captures it’s local KG structure in a user-personalized way.
A significant difference between our approach and traditional GNNs is that the edge weights in the graph are not given as input. We set them using user-specific relation scoring function that is trained in a supervised fashion. However, the added flexibility of edge weights makes the learning process prone to overfitting, since the only source of supervised signal for the relation scoring function is coming from user-item interactions (which are sparse in general). To remedy this problem, we develop a technique for regularization of edge weights during the learning process, which leads to better generalization. We develop an approach based on label smoothness (Zhu et al., 2003; Zhang and Lee, 2007), which assumes that adjacent entities in the KG are likely to have similar user relevancy labels/scores. In our context this assumption means that users tend to have similar preferences to items that are nearby in the KG. We prove that label smoothness regularization is equivalent to label propagation and we design a leave-one-outloss function for label propagation to provide extra supervised signal for learning the edge scoring function. We show that the knowledge-aware graph neural networks and label smoothness regularization can be unified under the same framework, where label smoothness can be seen as a natural choice of regularization on knowledge-aware graph neural networks.
We apply the proposed method to four real-world datasets of movie, book, music, and restaurant recommendations, in which the first three datasets are public datasets and the last is from Meituan-Dianping Group. Experiments show that our method achieves significant gains over state-of-the-art methods in recommendation accuracy. We also show that our method maintains strong recommendation performance in the cold-start scenarios where user-item interactions are sparse.
Graph Neural Networks (or Graph Convolutional Neural Networks, GCNs) aim to generalize convolutional neural networks to non-Euclidean domains (such as graphs) for robust feature learning. Bruna et al. (Bruna et al., 2014) define the convolution in Fourier domain and calculate the eigendecomposition of the graph Laplacian, Defferrard et al. (Defferrard et al., 2016) approximate the convolutional filters by Chebyshev expansion of the graph Laplacian, and Kipf et al. (Kipf and Welling, 2017) propose a convolutional architecture via a first-order approximation. In contrast to these spectral GCNs, non-spectral GCNs operate on the graph directly and apply “convolution” (i.e., weighted average) to local neighbors of a node (Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017).
Recently, researchers also deployed GNNs in recommender systems: PinSage (Ying et al., 2018) applies GNNs to the pin-board bipartite graph in Pinterest. Monti et al. (Monti et al., 2017) and Berg et al. (van den Berg et al., 2017) model recommender systems as matrix completion and design GNNs for representation learning on user-item bipartite graphs. Wu et al. (Wu et al., 2018) use GNNs on user/item structure graphs to learn user/item representations. The difference between these works and ours is that they are all designed for homogeneous bipartite graphs or user/item-similarity graphs where GNNs can be used directly, while here we investigate GNNs for heterogeneous KGs. Wang et al. (Wang et al., 2019c) use GCNs in KGs for recommendation, but simply applying GCNs to KGs without proper regularization is prone to overfitting and leads to performance degradation as we will show later. Schlichtkrull et al. also propose using GNNs to model KGs (Schlichtkrull et al., 2018), but not for the purpose of recommendations.
The goal of graph-based semi-supervised learning is to correctly label all nodes in a graph given that only a few nodes are labeled. Prior work often makes assumptions on the distribution of labels over the graph, and one common assumption is smooth variation of labels of nodes across the graph. Based on different settings of edge weights in the input graph, these methods are classified as: (1) Edge weights are assumed to be given as input and therefore fixed
(Zhu et al., 2003; Zhou et al., 2004; Baluja et al., 2008); (2) Edge weights are parameterized and therefore learnable (Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013). Inspired by these methods, we design a module of label smoothness regularization in our proposed model. The major distinction of our work is that the label smoothness constraint is not used for semi-supervised learning on graphs, but serves as regularization to assist the learning of edge weights and achieves better generalization for recommender systems.In general, existing KG-aware recommender systems can be classified into three categories: (1) Embedding-based methods (Zhang et al., 2016; Wang et al., 2018d; Huang et al., 2018; Wang et al., 2019b) pre-process a KG with knowledge graph embedding (KGE) (Wang et al., 2017a) algorithms, then incorporate learned entity embeddings into recommendation. Embedding-based methods are highly flexible in utilizing KGs to assist recommender systems, but the KGE algorithms focus more on modeling rigorous semantic relatedness (e.g., TransE (Bordes et al., 2013) assumes ), which are more suitable for graph applications such as link prediction rather than recommendations. In addition, embedding-based methods usually lack an end-to-end way of training. (2) Path-based methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018) explore various patterns of connections among items in a KG (a.k.a meta-path or meta-graph) to provide additional guidance for recommendations. Path-based methods make use of KGs in a more intuitive way, but they rely heavily on manually designed meta-paths/meta-graphs, which are hard to tune in practice. (3) Hybrid methods (Wang et al., 2018c; Sun et al., 2018; Wang et al., 2019c) combine the above two categories and learn user/item embeddings by exploiting the structure of KGs. Our proposed model can be seen as an instance of hybrid methods.
We begin by describing the KG-aware recommendations problem and introducing notation. In a typical recommendation scenario, we have a set of users and a set of items . The user-item interaction matrix is defined according to users’ implicit feedback, where indicates that user has engaged with item , such as clicking, watching, or purchasing. We also have a knowledge graph available, in which , , and denote the head, relation, and tail of a knowledge triple, and are the set of entities and relations in the knowledge graph, respectively. For example, the triple (The Silence of the Lambs, film.film.star, Anthony Hopkins) states the fact that Anthony Hopkins is the leading actor in film “The Silence of the Lambs”. In many recommendation scenarios, an item corresponds to an entity (e.g., item “The Silence of the Lambs” in MovieLens also appears in the knowledge graph as an entity). The set of entities is composed from items () as well as non-items (e.g. nodes corresponding to item/product properties). Given user-item interaction matrix and knowledge graph , our task is to predict whether user has potential interest in item with which he/she has not engaged before. Specifically, we aim to learn a prediction function , where
denotes the probability that user
will engage with item , and are model parameters of function .We list the key symbols used in this paper in Table 1.
Symbol | Meaning |
---|---|
Set of users | |
Set of items | |
User-item interaction matrix | |
Knowledge graph | |
Set of entities | |
Set of relations | |
Set of non-item entities | |
User-specific relation scoring function | |
Adjacency matrix of w.r.t. user | |
Diagonal degree matrix of | |
Raw entity feature | |
Entity representation in the -th layer | |
Transformation matrix in the -th layer | |
Item relevancy labeling function | |
Minimum-energy labeling function | |
Predicted relevancy label for item | |
Label smoothness regularization on |
In this section, we first introduce knowledge-aware graph neural networks and label smoothness regularization, respectively, then we present the unified model.
The first step of our approach is to transform a heterogeneous KG into a user-personalized weighted graph, which characterizes user’s preferences. To this end, similar to (Wang et al., 2019c), we use a user-specific relation scoring function that provides the importance of relation for user : , where and
are feature vectors of user
and relation type , respectively, and is a differentiable function such as inner product. Intuitively, characterizes the importance of relation to user . For example, a user may be more interested in movies that have the same director with the movies he/she watched before, but another user may care more about the leading actor of movies.Given user-specific relation scoring function of user , knowledge graph can therefore be transformed into a user-specific adjacency matrix , in which the -entry , and is the relation between entities and in .^{1}^{1}1In this work we treat an undirected graph, so is a symmetric matrix. If both triples and exist, we only consider one of and . This is due to the fact that: (1) and are the inverse of each other and semantically related; (2) Treating symmetric will greatly increase the matrix density. if there is no relation between and . See the left two subfigures in Figure 1 for illustration. We also denote the raw feature matrix of entities as , where is the dimension of raw entity features. Then we use multiple feed forward layers to update the entity representation matrix by aggregating representations of neighboring entities. Specifically, the layer-wise forward propagation can be expressed as
(1) |
In Eq. (1),
is the matrix of hidden representations of entities in layer
, and . is to aggregate representation vectors of neighboring entities. In this paper, we set , i.e., adding self-connection to each entity, to ensure that old representation vector of the entity itself is taken into consideration when updating entity representations. is a diagonal degree matrix with entries , therefore, is used to normalize and keep the entity representation matrix stable. is the layer-specific trainable weight matrix,is a non-linear activation function, and
is the number of layers.A single GNN layer computes the representation of an entity via a transformed mixture of itself and its immediate neighbors in the KG. We can therefore naturally extend the model to multiple layers to explore users’ potential interests in a broader and deeper way. The final output is , which is the entity representations that mix the initial features of themselves and their neighbors up to hops away. Finally, the predicted engagement probability of user with item is calculated by , where (i.e., the -th row of ) is the final representation vector of item , and
is a differentiable prediction function, for example, inner product or a multilayer perceptron. Note that
is user-specific since the adjacency matrix is user-specific. Furthermore, note that the system is end-to-end trainable where the gradients flow from via GNN (parameter matrix ) to and eventually to representations of users and items .It is worth noticing a significant difference between our model and GNNs: In traditional GNNs, edge weights of the input graph are fixed; but in our model, edge weights in Eq. (1) are learnable (including possible parameters of function and feature vectors of users and relations) and also requires supervised training like . Though enhancing the fitting ability of the model, this will inevitably make the optimization process prone to overfitting, since the only source of supervised signal is from user-item interactions outside GNN layers. Moreover, edge weights do play an essential role in representation learning on graphs, as highlighted by a large amount of prior works (Zhu et al., 2003; Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013; Velickovic et al., 2018). Therefore, more regularization on edge weights is needed to assist the learning of entity representations and to help generalize to unobserved interactions more efficiently.
Let’s see how an ideal set of edge weights should be like. Consider a real-valued label function on , which is constrained to take a specific value at node . In our context, if user finds the item relevant and has engaged with it, otherwise . Intuitively, we hope that adjacent entities in the KG are likely to have similar relevancy labels, which is known as label smoothness assumption. This motivates our choice of energy function :
(2) |
We show that the minimum-energy label function is harmonic by the following theorem:
Taking the derivative of the following equation
with respect to where , we have
The minimum-energy label function should satisfy that
Therefore, we have
∎
The harmonic property indicates that the value of at each non-item entity is the average of its neighboring entities, which leads to the following label propagation scheme (Zhu et al., 2005):
Repeating the following two steps:
Propagate labels for all entities: , where is the vector of labels for all entities;
Reset labels of all items to initial labels: , where is the vector of labels for all items and are initial labels;
will lead to .
Let . Since is fixed on , we are only interested in . We denote (the subscript is omitted from for ease of notation), and partition matrix into sub-matrices according to the partition of :
Then the label propagation scheme is equivalent to
(5) |
Repeat the above procedure, we have
(6) |
where is the initial value for . Now we show that . Since is row-normalized and is a sub-matrix of , we have
for all possible row index . Therefore,
As goes infinity, the row sum of converges to zero, which implies that . It’s clear that the choice of initial value does not affect the convergence.
Theorem 2 provides a way for reaching the minimum-energy of relevancy label function . However, does not provide any signal for updating the edge weights matrix , since the labeled part of , i.e., , equals their true relevancy labels ; Moreover, we do not know true relevancy labels for the unlabeled nodes .
To solve the issue, we propose minimizing the leave-one-out loss (Zhang and Lee, 2007). Suppose we hold out a single item and treat it unlabeled. Then we predict its label by using the rest of (labeled) items and (unlabeled) non-item entities. The prediction process is identical to label propagation in Theorem 2, except that the label of item is hidden and needs to be calculated. This way, the difference between the true relevancy label of (i.e., ) and the predicted label serves as a supervised signal for regularizing edge weights:
(7) |
where is the cross-entropy loss function. Given the regularization in Eq. (7), an ideal edge weight matrix should reproduce the true relevancy label of each held-out item while also satisfying the smoothness of relevancy labels.
Combining knowledge-aware graph neural networks and LS regularization, we reach the following complete loss function:
(8) |
where is the L2-regularizer, and are balancing hyper-parameters. In Eq. (8), the first term corresponds to the part of GNN that learns the transformation matrix and edge weights simultaneously, while the second term corresponds to the part of label smoothness that can be seen as adding constraint on edge weights . Therefore, serves as regularization on to assist GNN in learning edge weights.
It is also worth noticing that the first term can be seen as feature propagation on the KG while the second term can be seen as label propagation on the KG. A recommender for a specific user is actually a mapping from item features to user-item interaction labels, i.e., where is the feature vector of item . Therefore, Eq. (8) utilizes the structural information of the KG on both the feature side and the label side of to capture users’ higher-order preferences.
How can the knowledge graph help find users’ interests? To intuitively understand the role of the KG, we make an analogy with a physical equilibrium model as shown in Figure 2. Each entity/item is seen as a particle, while the supervised positive user-relevancy signal acts as the force pulling the observed positive items up from the decision boundary and the negative items signal acts as the force pushing the unobserved items down. Without the KG (Figure 1(a)), these items are only loosely connected with each other through the collaborative filtering effect (which is not drawn here for clarity). In contrast, edges in the KG serve as the rubber bands that impose explicit constraints on connected entities. When number of layers is (Figure 1(b)), representation of each entity is a mixture of itself and its immediate neighbors, therefore, optimizing on the positive items will simultaneously pull their immediate neighbors up together. The upward force goes deeper in the KG with the increase of (Figure 1(c)), which helps explore users’ long-distance interests and pull up more positive items. It is also interesting to note that the proximity constraint exerted by the KG is personalized since the strength of the rubber band (i.e., ) is user-specific and relation-specific: One user may prefer relation (Figure 1(b)) while another user (with same observed items but different unobserved items) may prefer relation (Figure 1(d)).
Despite the force exerted by edges in the KG, edge weights may be set inappropriately, for example, too small to pull up the unobserved items (i.e., rubber bands are too weak). Next, we show by Figure 1(e) that how the label smoothness assumption helps regularizing the learning of edge weights. Suppose we hold out the positive sample in the upper left and we intend to reproduce its label by the rest of items. Since the true relevancy label of the held-out sample is 1 and the upper right sample has the largest label value, the LS regularization term would enforce the edges with arrows to be large so that the label can “flow” from the blue one to the striped one as much as possible. As a result, this will tighten the rubber bands (denoted by arrows) and encourage the model to pull up the two upper pink items to a greater extent.
Movie | Book | Music | Restaurant | |
# users | 138,159 | 19,676 | 1,872 | 2,298,698 |
# items | 16,954 | 20,003 | 3,846 | 1,362 |
# interactions | 13,501,622 | 172,576 | 42,346 | 23,416,418 |
# entities | 102,569 | 25,787 | 9,366 | 28,115 |
# relations | 32 | 18 | 60 | 7 |
# KG triples | 499,474 | 60,787 | 15,518 | 160,519 |
In this section, we evaluate the proposed KGNN-LS model, and present its performance on four real-world scenarios: movie, book, music, and restaurant recommendations.
We utilize the following four datasets in our experiments for movie, book, music, and restaurant recommendations, respectively, in which the first three are public datasets and the last one is from Meituan-Dianping Group. We use Satori^{2}^{2}2https://searchengineland.com/library/bing/bing-satori, a commercial KG built by Microsoft, to construct sub-KGs for MovieLens-20M, Book-Crossing, and Last.FM datasets. The KG for Dianping-Food dataset is constructed by the internal toolkit of Meituan-Dianping Group. Further details of datasets are provided in Appendix A.
MovieLens-20M^{3}^{3}3https://grouplens.org/datasets/movielens/ is a widely used benchmark dataset in movie recommendations, which consists of approximately 20 million explicit ratings (ranging from 1 to 5) on the MovieLens website. The corresponding KG contains 102,569 entities, 499,474 edges and 32 relation-types.
Book-Crossing^{4}^{4}4http://www2.informatik.uni-freiburg.de/~cziegler/BX/ contains 1 million ratings (ranging from 0 to 10) of books in the Book-Crossing community. The corresponding KG contains 25,787 entities, 60,787 edges and 18 relation-types.
Last.FM^{5}^{5}5https://grouplens.org/datasets/hetrec-2011/ contains musician listening information from a set of 2 thousand users from Last.fm online music system. The corresponding KG contains 9,366 entities, 15,518 edges and 60 relation-types.
Dianping-Food is provided by Dianping.com^{6}^{6}6https://www.dianping.com/, which contains over 10 million interactions (including clicking, buying, and adding to favorites) between approximately 2 million users and 1 thousand restaurants. The corresponding KG contains 28,115 entities, 160,519 edges and 7 relation-types.
The statistics of the four datasets are shown in Table 2.
Model | MovieLens-20M | Book-Crossing | Last.FM | Dianping-Food | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@2 | R@10 | R@50 | R@100 | R@2 | R@10 | R@50 | R@100 | R@2 | R@10 | R@50 | R@100 | R@2 | R@10 | R@50 | R@100 | |
SVD | 0.036 | 0.124 | 0.277 | 0.401 | 0.027 | 0.046 | 0.077 | 0.109 | 0.029 | 0.098 | 0.240 | 0.332 | 0.039 | 0.152 | 0.329 | 0.451 |
LibFM | 0.039 | 0.121 | 0.271 | 0.388 | 0.033 | 0.062 | 0.092 | 0.124 | 0.030 | 0.103 | 0.263 | 0.330 | 0.043 | 0.156 | 0.332 | 0.448 |
LibFM + TransE | 0.041 | 0.125 | 0.280 | 0.396 | 0.037 | 0.064 | 0.097 | 0.130 | 0.032 | 0.102 | 0.259 | 0.326 | 0.044 | 0.161 | 0.343 | 0.455 |
PER | 0.022 | 0.077 | 0.160 | 0.243 | 0.022 | 0.041 | 0.064 | 0.070 | 0.014 | 0.052 | 0.116 | 0.176 | 0.023 | 0.102 | 0.256 | 0.354 |
CKE | 0.034 | 0.107 | 0.244 | 0.322 | 0.028 | 0.051 | 0.079 | 0.112 | 0.023 | 0.070 | 0.180 | 0.296 | 0.034 | 0.138 | 0.305 | 0.437 |
RippleNet | 0.045 | 0.130 | 0.278 | 0.447 | 0.036 | 0.074 | 0.107 | 0.127 | 0.032 | 0.101 | 0.242 | 0.336 | 0.040 | 0.155 | 0.328 | 0.440 |
KGNN-LS | 0.043 | 0.155 | 0.321 | 0.458 | 0.045 | 0.082 | 0.117 | 0.149 | 0.044 | 0.122 | 0.277 | 0.370 | 0.047 | 0.170 | 0.340 | 0.487 |
Model | Movie | Book | Music | Restaurant |
---|---|---|---|---|
SVD | 0.963 | 0.672 | 0.769 | 0.838 |
LibFM | 0.959 | 0.691 | 0.778 | 0.837 |
LibFM + TransE | 0.966 | 0.698 | 0.777 | 0.839 |
PER | 0.832 | 0.617 | 0.633 | 0.746 |
CKE | 0.924 | 0.677 | 0.744 | 0.802 |
RippleNet | 0.960 | 0.727 | 0.770 | 0.833 |
KGNN-LS | 0.979 | 0.744 | 0.803 | 0.850 |
We compare the proposed KGNN-LS model with the following baselines for recommender systems, in which the first two baselines are KG-free while the rest are all KG-aware methods. The hyper-parameter setting of KGNN-LS is provided in Appendix B.
SVD (Koren, 2008) is a classic CF-based model using inner product to model user-item interactions. We use the unbiased version (i.e., the predicted engaging probability is modeled as ). The dimension and learning rate for the four datasets are set as: , for MovieLens-20M, Book-Crossing; , for Last.FM; , for Dianping-Food.
LibFM + TransE extends LibFM by attaching an entity representation learned by TransE (Bordes et al., 2013) to each user-item pair. The dimension of TransE is for all datasets.
PER (Yu et al., 2014) is a representative of path-based methods, which treats the KG as heterogeneous information networks and extracts meta-path based features to represent the connectivity between users and items. We use manually designed “user-item-attribute-item” as meta-paths, i.e., “user-movie-director-movie”, “user-movie-genre-movie”, and “user-movie-star-movie” for MovieLens-20M; “user-book-author-book” and “user-book-genre-book” for Book-Crossing, “user-musician-date_of_birth-musician” (date of birth is discretized), “user-musician-country-musician”, and “user-musician-genre-musician” for Last.FM; “user-restaurant-dish-restaurant”, “user-restaurant-business_area-restaurant”, “user-restaurant-tag-restaurant” for Dianping-Food. The settings of dimension and learning rate are the same as SVD.
CKE (Zhang et al., 2016) is a representative of embedding-based methods, which combines CF with structural, textual, and visual knowledge in a unified framework. We implement CKE as CF plus a structural knowledge module in this paper. The dimension of embedding for the four datasets are , , , . The training weight for KG part is for all datasets. The learning rate are the same as in SVD.
RippleNet (Wang et al., 2018c) is a representative of hybrid methods, which is a memory-network-like approach that propagates users’ preferences on the KG for recommendation. For RippleNet, , , , , for MovieLens-20M; , , , , for Last.FM; , , , , for Dianping-Food.
To validate the connection between the knowledge graph and user-item interaction , we conduct an empirical study where we investigate the correlation between the shortest path distance of two randomly sampled items in the KG and whether they have common user(s) in the dataset, that is there exist user(s) that interacted with both items. For MovieLens-20M and Last.FM, we randomly sample ten thousand item pairs that have no common users and have at least one common user, respectively, then count the distribution of their shortest path distances in the KG. The results are presented in Figure 3, which clearly show that if two items have common user(s) in the dataset, they are likely to be more close in the KG. For example, if two movies have common user(s) in MovieLens-20M, there is a probability of that they will be within 2 hops in the KG, while the probability is if they have no common user. This finding empirically demonstrates that exploiting the proximity structure of the KG can assist making recommendations. This also justifies our motivation to use label smoothness regularization to help learn entity representations.
We evaluate our method in two experiment scenarios: (1) In top- recommendation, we use the trained model to select items with highest predicted click probability for each user in the test set, and choose to evaluate the recommended sets. (2) In click-through rate (CTR) prediction, we apply the trained model to predict each piece of user-item pair in the test set (including positive items and randomly selected negative items). We use
as the evaluation metric in CTR prediction.
The results of top- recommendation and CTR prediction are presented in Tables 3 and 4, respectively, which show that KGNN-LS outperforms baselines by a significant margin. For example, the of KGNN-LS surpasses baselines by , , , and on average in MovieLens-20M, Book-Crossing, Last.FM, and Dianping-Food datasets, respectively.
We also show daily performance of KGNN-LS and baselines on Dianping-Food to investigate performance stability. Figure 6 shows their
score from September 1, 2018 to September 30, 2018. We notice that the curve of KGNN-LS is consistently above baselines over the test period; Moreover, the performance of KGNN-LS is also with low variance, which suggests that KGNN-LS is also robust and stable in practice.
Is the proposed LS regularization helpful in improving the performance of GNN? To study the effectiveness of LS regularization, we fix the dimension of hidden layers as , , and , then vary from to to see how performance changes. The results of in Last.FM dataset are plotted in Figure 6. It is clear that the performance of KGNN-LS with a non-zero is better than (the case of Wang et al. (Wang et al., 2019c)), which justifies our claim that LS regularization can assist learning the edge weights in a KG and achieve better generalization in recommender systems. But note that a too large is less favorable, since it overwhelms the overall loss and misleads the direction of gradients. According to the experiment results, we find that a between and is preferable in most cases.
SVD | 0.882 | 0.913 | 0.938 | 0.955 | 0.963 |
---|---|---|---|---|---|
LibFM | 0.902 | 0.923 | 0.938 | 0.950 | 0.959 |
LibFM+TransE | 0.914 | 0.935 | 0.949 | 0.960 | 0.966 |
PER | 0.802 | 0.814 | 0.821 | 0.828 | 0.832 |
CKE | 0.898 | 0.910 | 0.916 | 0.921 | 0.924 |
RippleNet | 0.921 | 0.937 | 0.947 | 0.955 | 0.960 |
KGNN-LS | 0.961 | 0.970 | 0.974 | 0.977 | 0.979 |
One major goal of using KGs in recommender systems is to alleviate the sparsity issue. To investigate the performance of KGNN-LS in cold-start scenarios, we vary the size of training set of MovieLens-20M from to (while the validation and test set are kept fixed), and report the results of in Table 5. When , decreases by , , , , , and for the six baselines compared to the model trained on full training data (), but the performance decrease of KGNN-LS is only . This demonstrates that KGNN-LS still maintains predictive performance even when user-item interactions are sparse.
We first analyze the sensitivity of KGNN-LS to the number of GNN layers . We vary from to while keeping other hyper-parameters fixed. The results are shown in Table 6. We find that the model performs poorly when , which is because a larger will mix too many entity embeddings in a given entity, which over-smoothes the representation learning on KGs. KGNN-LS achieves the best performance when or in the four datasets.
1 | 2 | 3 | 4 | |
---|---|---|---|---|
MovieLens-20M | 0.155 | 0.146 | 0.122 | 0.011 |
Book-Crossing | 0.077 | 0.082 | 0.043 | 0.008 |
Last.FM | 0.122 | 0.106 | 0.105 | 0.057 |
Dianping-Food | 0.165 | 0.170 | 0.061 | 0.036 |
We also examine the impact of the dimension of hidden layers on the performance of KGNN-LS. The result in shown in Table 7. We observe that the performance is boosted with the increase of at the beginning, because more bits in hidden layers can improve the model capacity. However, the performance drops when further increases, since a too large dimension may overfit datasets. The best performance is achieved when .
4 | 8 | 16 | 32 | 64 | 128 | |
---|---|---|---|---|---|---|
MovieLens-20M | 0.134 | 0.141 | 0.143 | 0.155 | 0.155 | 0.151 |
Book-Crossing | 0.065 | 0.073 | 0.077 | 0.081 | 0.082 | 0.080 |
Last.FM | 0.111 | 0.116 | 0.122 | 0.109 | 0.102 | 0.107 |
Dianping-Food | 0.155 | 0.170 | 0.167 | 0.166 | 0.163 | 0.161 |
We also investigate the running time of our method with respect to the size of KG. We run experiments on a Microsoft Azure virtual machine with 1 NVIDIA Tesla M60 GPU, 12 Intel Xeon CPUs (E5-2690 v3 @2.60GHz), and 128GB of RAM. The size of the KG is increased by up to five times the original one by extracting more triples from Satori, and the running times of all methods on MovieLens-20M are reported in Figure 6. Note that the trend of a curve matters more than the real values, since the values are largely dependent on the minibatch size and the number of epochs (yet we did try to align the configurations of all methods). The result show that KGNN-LS exhibits strong scalability even when the KG is large.
In this paper, we propose knowledge-aware graph neural networks with label smoothness regularization for recommendation. KGNN-LS applies GNN architecture to KGs by using user-specific relation scoring functions and aggregating neighborhood information with different weights. In addition, the proposed label smoothness constraint and leave-one-out loss provide strong regularization for learning the edge weights in KGs. We also discuss how KGs benefit recommender systems and how label smoothness can assist learning the edge weights. Experiment results show that KGNN-LS outperforms state-of-the-art baselines in four recommendation scenarios and achieves desirable scalability with respect to KG size.
In this paper, LS regularization is proposed for recommendation task with KGs. It is interesting to examine the LS assumption on other graph tasks such as link prediction and node classification. Investigating the theoretical relationship between feature propagation and label propagation is also a promising direction.
Acknowledgements
. This research has been supported in part by NSF OAC-1835598, DARPA MCS, ARO MURI, Boeing, Docomo, Hitachi, Huawei, JD, Siemens, and Stanford Data Science Initiative.
Leveraging meta-path based context for top-n recommendation with a neural co-attention model. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1531–1540.International Conference on Machine Learning
. 2014–2023.the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
.MovieLens-20M, Book-Crossing, and Last.FM dataset contain explicit feedbacks data (Last.FM provides the listening count as weight for each user-item interaction). Therefore, we transform them into implicit feedback, where each entry is marked with 1 indicating that the user has rated the item positively. The threshold of positive rating is 4 for MovieLens-20M, while no threshold is set for Book-Crossing and Last.FM due to their sparsity. Additionally, we randomly sample an unwatched set of items and mark them as 0 for each user, the number of which equals his/her positively-rated ones.
We use Microsoft Satori to construct the KGs for MovieLens-20M, Book-Crossing, and Last.FM dataset. In one triple in Satori KG, the head and tail are either IDs or textual content, and the relation is with the form “domain.head_category.tail_category” (e.g., “book.book.author”). We first select a subset of triples from the whole Satori KG with a confidence level greater than 0.9. Given the sub-KG, we collect Satori IDs of all valid movies/books/musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), for the three datasets. Items with multiple matched or no matched entities are excluded for simplicity. After having the set of item IDs, we match these item IDs with the head of all triples in Satori sub-KG, and select all well-matched triples as the final KG for each dataset.
Dianping-Food dataset is collected from Dianping.com, a Chinese group buying website hosting consumer reviews of restaurants similar to Yelp. We select approximately 10 million interactions between users and restaurants in Dianping.com from May 1, 2015 to December 12, 2018. The types of positive interactions include clicking, buying, and adding to favorites, and we sample negative interactions for each user. The KG for Dianping-Food is collected from Meituan Brain, an internal knowledge graph built for dining and entertainment by Meituan-Dianping Group. The types of entities include POI (restaurant), city, first-level and second-level category, star, business area, dish, and tag; The types of relations correspond to the types of entities (e.g., “organization.POI.has_dish”).
In KGNN-LS, we set functions and as inner product, as ReLU for non-last-layers and for the last-layer. Note that the size of neighbors of an entity in a KG may vary significantly over the KG. To keep the computation more efficient, we uniformly sample a fixed-size set of neighbors for each entity instead of using its full set of neighbors. The number of sampled neighbors for each entity is denoted by . Hyper-parameter settings are given in Table 8, which are determined by optimizing on a validation set. The search spaces for hyper-parameters are as follows:
;
;
;
;
;
.
For each dataset, the ratio of training, validation, and test set is . Each experiment is repeated
times, and the average performance is reported. All trainable parameters are optimized by Adam algorithm. The code of KGNN-LS is implemented with Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.
Movie | Book | Music | Restaurant | |
16 | 8 | 8 | 4 | |
32 | 64 | 16 | 8 | |
1 | 2 | 1 | 2 | |
1.0 | 0.5 | 0.1 | 0.5 | |
Comments
There are no comments yet.