Personalized recommender systems have been widely applied in different application scenarios [8, 10, 9, 22]. The knowledge graph (KG) including rich semantic relations between items has recently been shown to be effective in improving recommendation performances . Essentially, KG is a heterogeneous network where nodes correspond to entities and edges correspond to relations. The main challenge of incorporating KG for recommendation is how to effectively exploit the relations between entities and the graph structure of KG. In practice, one group of methods impose well-designed additive regularization loss term to capture the KG structure [24, 3]. However, they can not explicitly consider the semantic relation information of KG into the recommendation model. Another group of methods focus on extracting the high-order connectivity information between entities along paths which are always manually designed or selected based on special criteria [23, 25]. These approaches may heavily rely on domain knowledge. Recently, the quick development of graph neural networks (GNN)  motivates the application of graph convolutional networks (GCN)  and graph attention networks (GAT)  in developing end-to-end KG-based recommender systems [19, 21], which can aggregate the context information from the structural neighbors of an entity in KG.
Although GNN-based recommendation methods can automatically capture both the structure and semantic information of KG, they may still have the following deficiencies. Firstly, most GNN-based methods lack of modeling user-specific preferences on entities, when aggregating the local graph context (i.e., the first-order neighbors) of an entity in KG. As shown in Figure 1, both users have interactions with the item . However, they prefer may due to different reasons. For example, prefers because of the attribute entity of in KG, while pays more attentions to its attribute entity . The methods that ignore this situation are insufficient to model users’ personalized preferences. Secondly, the non-local graph context (i.e., the set of most related high-order neighbors) of an entity in KG is not explicitly captured in existing GNN-based recommendation methods. In KG, some items may have very few neighbors, thus some important entities may not be directly connected to them. For example, in Figure 1, the item has only one entity linked with it, thus the aggregation of local context information for the entity is not enough to represent . Moreover, we can also observe that entity is connected with along many multi-hop paths, which demonstrates the importance of to . Exiting GNN-based methods [19, 21] address this limitation by feature propagation layer by layer. However, this may weaken the effects of farther connected entities or even bring noise information.
To address these issues, we propose a novel recommendation framework, namely Contextualized Graph Attention Network (CGAT), which explicitly exploits both the local and non-local context of an entity in KG, as well as the item context extracted from users’ historical data. The contributions made in this paper are as follows: (1) We propose a user-specific graph attention mechanism to aggregate the local context information in KG for recommendation, based on the intuition that different users may have different preferences on the same entity in KG; (2) We propose to explicitly exploit the non-local context information in KG, by developing a biased random walk sampling process to extract the non-local context of an entity, and employing a recurrent neural network (RNN) to model the dependency between the entity and its non-local context in KG; (3) We develop an item-specific attention mechanism that exploits the context information extracted from a user’s historical behavior data to model her preferences on items; (4) We perform extensive experiments on real datasets to demonstrate the effectiveness of CGAT. Experimental results indicate that CGAT usually outperforms state-of-the-art KG-based recommendation methods.
2 Related Work
KG-based recommendation methods can be categorized into three main groups: regularization-based methods, path-based methods, and GNN-based methods. The regularization-based methods exploit the KG structure by imposing regularization terms into the loss function used to learn entity embedding. For example, CKE is a representative method, which uses TransR  to derive semantic entity representations from item KG. The KTUP model  is proposed to jointly train the personalized recommendation and KG completion tasks, by sharing the item embedding. The high-order feature interactions between items and entities can be further approximated by a cross&compress unit . These methods are highly flexible. However, they lack an explicit modeling of the semantic relations in KG. The path-based methods exploit various connection patterns between entities. For example, the recent works [23, 13]estimate the meta-path based similarities for recommendation. In , matrix factorization and factorization machine techniques are integrated to assemble different meta-path information. To address the limitation of manually designed meta-paths, different selection rules or propagation methods have been proposed . For example, in , the length condition is used to extract paths and then a batch of RNN are applied to aggregate the path information. Besides the length, multi-hop relational paths can also be inducted based on item associations . Recently, the GNN-based methods aim to develop the end-to-end KG-based recommender systems. For example, the KGNN-LS model  employs a trainable function that calculates the relation weights for each user to transfer the KG into a user-specific weighted graph, and then applies GCN on this graph to learn item embedding. In , the graph attention mechanism is adopted to aggregate and propagate local neighborhood information of an entity, without considering users’ personalized preferences on entities. On summary, these GNN-based methods implicitly aggregate the high-order neighborhood information via layer by layer propagation, instead of explicitly modeling the dependency between an entity and its high-order neighbors.
3 Contextualized Graph Attention Network
We assume the item KG is available, where denotes the set of entities, denotes the set of relations, and denotes the set of entity-relation-entity triples describing the KG structure. Here , , and denote the head entity, relation, and tail entity of a knowledge triple, respectively. and are used to denote the embedding of the entity and relation , where denotes the dimensionality of latent space. Note that the items are treated as a special type of entities in the KG. In addition, we denote the set of users by , the set of items by , and all the observed user-item interactions by . For each user , we denote the set of items she has interacted by , and use to denote her embedding. Figure 2 shows the structure details of the proposed CGAT model.
3.1 Exploiting Knowledge Graph Context
CGAT exploits KG context from two aspects: (a) local context information, and (b) non-local context information.
3.1.1 Local Graph Context
For the entity corresponding to an item, it is always linked with many other entities that can enrich its information in KG. To consider users’ personalized preferences on entities, we develop a user-specific graph attention mechanism to aggregate the neighborhood information of an entity in KG. For different users, we compute different attention scores for the same neighborhood entity. The embedding of neighborhood entities can then be aggregated based on the user-specific attention scores. Here, we denote the local neighbors of an entity by , and define as the local graph context of in KG. Moreover, we also argue that the neighborhood entities may have different impacts, if they are connected via different relations. To incorporate relation into the attention mechanism, we firstly integrate the embedding of a neighborhood entity and the embedding of corresponding relation
by the following linear transformation,
where is the concatenation operation, is the weight matrix. The user-specific attention score that describes the importance of the entity to the entity , for a target user , is defined as follows,
is performed by a single-layer feed forward neural network, which is defined as follows,
where is a non-linear transform of defined as . Here, , , and
are the weight matrices and bias vectors respectively. Given the coefficient of each neighboring entity of, we compute the linear combination of their embedding to obtain the local neighborhood embedding of as follows,
Then, we aggregate the embedding of entity and it’s local neighborhood embedding to form a local contextual embedding for as follows,
where and are the weight matrix and bias vector of the aggregator.
3.1.2 Non-Local Graph Context
The user-specific graph attention network explicitly aggregates the local neighbor (one-hop) information of a target entity to enrich the representation of the target entity. However, this is not enough to capture the non-local context of an entity in KG, and also has weak representation ability for the nodes which have few connections in KG. To offset this gap, we propose a biased random walk based GRU module to aggregate non-local context information of entities.
The biased random walk sampling (BRWS) procedure is used to extract the non-local context of a target entity . To achieve a wider depth-first search, we repeat biased random walk from to obtain paths, which have a fixed length
. The walk iteratively travels to the neighbors of current entity with a probability, which is defined as follows,
where is the -th entity of a path, denotes the root entity . To encourage wider search, we empirically set . After obtaining the paths and entities by walk, we sort entities according to their frequency in walks in descending order, and choose a set of top-ranked entities orderly. These entities are defined as the non-local graph context of the entity in KG, and denoted by . In the experiments, we empirically set , and set the parameters , , and to 0.2, 15, and 8, respectively.
In this work, we employ GRU to model the dependency between an entity and its non-local context , because GRU can yield better performance in processing sequence data (i.e., can be seen as a frequency sequence data). Indeed, the more frequently an entity appears in random walks, the more important it is to the target entity . Based on this intuition, we input into GRU in reverse order, and use the last step output as the embedding of , which is denoted by,
where denotes the reverse set of . Then, we aggregate and to form the non-local contextual embedding for as follows,
Here, we use the same aggregator parameters as in Eq (5). Given the embeddings of local and non-local context of in KG, we apply a gate mechanism to integrate these two embeddings by learning the weights in each dimension as,
where is a learnable vector,
denotes the sigmoid function. As items are a special type of entities in KG, we can use Eq. (9) to compute the context embedding of item , considering its local context and non-local context in KG. Then, we concatenate and to obtain the contextualized representation of an item as .
3.2 Exploiting Interaction Graph Context
In practice, a user’s historical items are usually used to describe her potential interests . For example, the classical SVD++ model  treats a user ’s historical items as the implicit feedback given by , and model the influences of on a target item for recommendation. Following similar spirit, we define as the interaction graph context of user . Then, we develop an item-specific attention mechanism to model the influences of on . The basic assumption is that a user’s historical item may have different importance in estimating her preferences on different candidate items. For each item , its relevance weight with respect to the target item is defined as,
where is a weight vector, is the bias, and are the contextualized representations of items and . Then, we define the embedding of the graph context , with respect to a target item , as follows,
A non-linear transformation, where ReLU is the activation function, is then used to aggregateand to form the contextual embedding for as follows,
where and are the weight matrix and bias vector. We concatenate and to form the contextualized representation for as . The prediction of ’s preference on can be defined as .
3.3 Learning Algorithm
The Bayesian personalized ranking (BPR) optimization criterion  is used to learn the model parameters of CGAT. BPR assumes that the interacted items should have higher ranking scores than the un-interacted items for each user. Here, we define the BPR loss as follows,
where is constructed by negative sampling. Empirically, for each , we randomly sampling items from in the experiments. As we also need to learn the embedding of entities and relations in KG, we design a regularization loss based on the KG structure. Specifically, for each triple , we first define the following score to describe the distance between the head entity and the tail entity via relation in the latent space,
Then, we define the regularization loss as follows,
where is constructed by randomly sampling an entity from , for each . The motivation is that, in the latent space, the distance between an entity and its directly connected neighbor should be smaller than the distance between and the entity that is not directly connected to , via relation . Then, the model parameters can be learned by solving the following objective function,
where denotes all the parameters of CGAT, and are the regularization parameters. The problem in Eq. (16) is solved by a gradient descent algorithm. The details of the optimization algorithm are summarized in Algorithm 1.
In the implementation of CGAT, we randomly sample neighbors from for a target entity , and historical items from for a target user , to compute the attention weights defined in Eq. (2) and Eq. (10) respectively. This trick can help keep the computational pattern of each mini-batch fixed and improve the computation efficiency. Moreover, we also set the size of non-local context to . In model training, and are fixed. Let denote the number of sampled user-item interactions in each batch. The time complexity of biased random walk sampling procedure is , which can be performed before training. In each iteration, to exploit KG context, the user-specific graph attention mechanism and the GRU module have computational complexity . The complexity of exploiting interaction graph context is . The overall complexity of each mini-bacth iteration is , which is linear with all hyper-parameters except for .
4.1 Experimental Settings
Datasets: The experiments are performed on three public datasets: Last-FM111https://grouplens.org/datasets/hetrec-2011/, Movielens-1M222https://grouplens.org/datasets/movielens/1m/, and Book-Crossing333http://www2.informatik.uni-freiburg.de/cziegler/BX/ (respectively denoted by FM, ML, and BC). Following [18, 20, 19], we keep all the ratings on FM and BC datasets as observed implicit feedback, due to data sparsity. For ML dataset, we keep ratings larger than 4 as implicit feedback. The KGs of these datasets are constructed by Microsoft Satori, and are currently public available444https://github.com/hwwang55. As introduced in , only the triples from the whole KG with a confidence level greater than 0.9 are retained. The sizes of ML and BC KGs are further reduced by only selecting the triples where the relation name contains ”film” and ”book”, respectively. For these datasets, we match the items and entities in sub-KGs by their names (e.g., head, film.film.name, tail for ML). The items matching no entities or multiple entities are removed. Table 1 summarizes the statistics of these experimental datasets.
Setup and Metrics: For each dataset, we randomly select 60% of the observed user-item interactions for model training, and choose another 20% of interactions for parameter tuning. The remaining 20% of interactions are used as testing data. The quality of the top-
item recommendation is assessed by three widely used evaluation metrics: Precision@, Recall@, and Hit Ratio@. In the experiments, we set to 10, 20, and 50. For each metric, we first compute the accuracy for each user on the testing data, and then report the averaged accuracy over all users.
Baseline Methods: We compare CGAT with the following models: (1) CFKG  integrates the multi-type user behaviors and item KG into a unified graph, and employs TransE  to learn entity embedding. (2) RippleNet  exploits KG information by propagating a user’s preferences over the set of entities along paths in KG rooted at her historical items; (3) MKR  is a multi-task feature learning approach that uses KG embedding task to assist the recommendation task; (4) KGNN-LS  applies GCN on KG to compute the item embedding by propagating and aggregating the neighborhood information on item KG. (5) KGAT  employs graph attention mechanism on KG to exploit the graph context for recommendation.
Implementation Details: For CGAT, the dimensionality of latent space is chosen from . The number of local neighbors of an entity and the number of a user’s historical items used in model training are selected from . The regularization parameters and are chosen from . The learning rate is chosen from
. The hyper-parameters of baseline methods are set following original papers. For all methods, optimal hyper-parameters are determined by the performances on the validation data. We implement CGAT by Pytorch, and the Adam optimizer is used to learn the model parameters.
4.2 Performance Comparison
Table 2 summarizes the results on different datasets. We make the following observations. On FM and ML datasets, KGAT achieves the best performances among all baselines. On BC dataset, MKR achieves comparable results with KGNN-LS, and outperforms CFKG, RippleNet, and KGAT. The KG and interaction graphs on BC dataset are very sparse. MKR jointly solves the KG embedding and recommendation tasks by learning high-order feature interactions between items and entities. The cross&compress units are effective to transfer knowledge between the user-item interaction graph and KG, thus can help solve the data sparsity problem. Moreover, CGAT usually achieves the best performances on all datasets, in terms of all metrics. In most of the scenarios (i.e., 23 among 27 evaluation metrics), the proposed CGAT method significantly outperforms baseline methods with , using the Wilcoxon signed rank significance test. Over all datasets, on average, CGAT outperforms CFKG, RippleNet, MKR, KGNN-LS, and KGAT by 26.07%, 21.32%, 22.29%, 21.92%, 9.56%, respectively, in terms of HR@20. These results demonstrate the effectiveness of CGAT in exploiting both the KG context and users’ historical interaction context for recommendation.
4.3 Ablation Study
|Dataset||CGATw/o L||CGATw/o G||CGATw/o UA||CGAT|
Moreover, we also conduct ablation studies to evaluate the performances of the following CGAT variants: (1) CGATw/o L deletes the local context embedding of item from CGAT and only considers the non-local context embedding as final context embedding, i.e., the coefficient in Eq.(9) is set to ; (2) CGATw/o G removes the non-local context embedding of item from original model, which is contrast to CGATw/o L model; (3) CGATw/o UA removes the user’s embedding in exploiting the local context information in KG (i.e., removing in Eq. (3)).
Due to space limitation, we only report the recommendation accuracy measured by HR@20. We summarize the results in Table 3, and have the following findings. CGAT consistently outperforms the variants CGATw/o L and CGATw/o G, indicating both local and non-local context in KG are essential for recommendation. CGAT achieves better performance than CGATw/o UA. This demonstrates the user-specific graph attention mechanism is more suitable for personalized recommendation than simple attention mechanism that can not capture users’ personalized preferences. CGATw/o L is slightly superior than CGATw/o G on ML and BC datasets. This indicates that non-local context information plays a complementary role to the local context information, and sometimes may be more important than local context information in improving the recommendation accuracy.
4.4 Parameter Sensitivity Study
Figure 3 summarizes the performances of CGAT with respect to (w.r.t.) different settings of key parameters. As the size of neighboring entities in KG usually varies for different items, we study how fixed size of sampled neighbors would affect the performance. From Figure 3(a), we can note that CGAT achieves the best performance when is set to 4, while larger does not help further improve the performance. This optimal setting of is close to the average number of neighbors of an entity in KG, which is 3.31 on FM dataset. Then, we vary the number of a user’s historical items used to represent her potential preferences. As shown in Figure 3(a), the best performance is achieved by setting to 16. When is larger than 16, further increase of would reduce the performance. Figure 3(b) shows the performance trend of CGAT w.r.t. different settings of . The performances achieved by setting to and are better than that achieved by setting to 0. This observation demonstrates that the KG structure constraint in Eq. (15) can help improve the recommendation accuracy. Moreover, we also study the impacts of the number of sampled paths and the path length in the BRWS module. From Figure 3(c), we can note the best performance is achieved by setting to 15. This indicates the most relevant entities in the non-local neighborhood of an entity can be captured by performing 15 times random walk sampling. As shown in Figure 3(d), better performance can be achieved by setting in the range between 4 and 12. Further increasing causes more training time, however sometimes may cause the decrease in recommendation performances.
5 Conclusion and Future Work
This paper proposes a novel recommendation model, called Context-aware Graph Attention Network (CGAT), which explicitly exploits both local and non-local context information in KG and the interaction context information given by users’ historical behaviors. Specifically, CGAT aggregates the local context information in KG by a user-specific graph attention mechanism, which captures users’ personalized preferences on entities. To incorporate the non-local context in KG, a bias random walk based sampling process is used to extract important entities for the target entity over entire KG, and a GRU module is employed to explicitly aggregate these entity embedding. In addition, CGAT utilizes an item-specific attention mechanism to model the influences between items. The superiority of CGAT has been validated by comparing with state-of-the-art baselines on three datasets. For future work, we intend to develop different aggregation strategies to integrate the context information in KG and interaction graph to improve recommendation accuracy.
-  (2018) Learning heterogeneous knowledge base embeddings for explainable recommendation. Algorithms 11 (9), pp. 137. Cited by: §4.1.
-  (2013) Translating embeddings for modeling multi-relational data. In NIPS’13, pp. 2787–2795. Cited by: §4.1.
-  (2019) Unifying knowledge graph learning and recommendation: towards a better understanding of user preferences. In WWW’19, pp. 151–161. Cited by: §1, §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2017) Semi-supervised classification with graph convolutional networks. In ICLR’17, Cited by: §1.
-  (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD’08, pp. 426–434. Cited by: §3.2.
-  (2015) Learning entity and relation embeddings for knowledge graph completion. In AAAI’15, Cited by: §2.
-  (2014) Exploiting geographical neighborhood characteristics for location recommendation. In CIKM’14, pp. 739–748. Cited by: §1.
-  (2018) Dynamic bayesian logistic matrix factorization for recommendation with implicit feedback.. In IJCAI’18, pp. 3463–3469. Cited by: §1.
-  (2017) Learning user dependencies for recommendation. In IJCAI’17, pp. 2379–2385. Cited by: §1.
-  (2019) Jointly learning explainable rules for recommendation with knowledge graph. In WWW’19, pp. 1210–1221. Cited by: §2.
-  (2009) BPR: bayesian personalized ranking from implicit feedback. In UAI’09, pp. 452–461. Cited by: §3.3.
-  (2015) Semantic path based personalized recommendation on weighted heterogeneous information networks. In CIKM’15, pp. 453–462. Cited by: §2.
-  (2014) Collaborative filtering beyond the user-item matrix: a survey of the state of the art and future challenges. ACM Computing Surveys 47 (1), pp. 3. Cited by: §3.2.
-  (2019) Research commentary on recommendations with side information: a survey and research directions. Electronic Commerce Research and Applications 37, pp. 100879. Cited by: §1.
-  (2018) Recurrent knowledge graph embedding for effective recommendation. In RecSys’18, pp. 297–305. Cited by: §2.
-  (2018) Graph attention networks. In ICLR’18, Cited by: §1.
-  (2018) Ripplenet: propagating user preferences on the knowledge graph for recommender systems. In CIKM’18, pp. 417–426. Cited by: §2, §4.1, §4.1.
-  (2019) Knowledge graph convolutional networks for recommender systems with label smoothness regularization. In KDD’19, Cited by: §1, §1, §2, §4.1, §4.1.
-  (2019) Multi-task feature learning for knowledge graph enhanced recommendation. In WWW’19, pp. 2000–2010. Cited by: §2, §4.1, §4.1.
-  (2019) KGAT: knowledge graph attention network for recommendation. In KDD’19, Cited by: §1, §1, §2, §4.1.
-  (2019) PD-gan: adversarial learning for personalized diversity-promoting recommendation. In IJCAI’19, pp. 3870–3876. Cited by: §1.
-  (2013) Recommendation in heterogeneous information networks with implicit user feedback. In RecSys’13, pp. 347–350. Cited by: §1, §2.
-  (2016) Collaborative knowledge base embedding for recommender systems. In KDD’16, pp. 353–362. Cited by: §1, §2.
-  (2017) Meta-graph based recommendation fusion over heterogeneous information networks. In KDD’17, pp. 635–644. Cited by: §1, §2.
-  (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §1.