Knowledge-Enhanced Top-K Recommendation in Poincaré Ball

01/13/2021 ∙ by Chen Ma, et al. ∙ HUAWEI Technologies Co., Ltd. McGill University 0

Personalized recommender systems are increasingly important as more content and services become available and users struggle to identify what might interest them. Thanks to the ability for providing rich information, knowledge graphs (KGs) are being incorporated to enhance the recommendation performance and interpretability. To effectively make use of the knowledge graph, we propose a recommendation model in the hyperbolic space, which facilitates the learning of the hierarchical structure of knowledge graphs. Furthermore, a hyperbolic attention network is employed to determine the relative importances of neighboring entities of a certain item. In addition, we propose an adaptive and fine-grained regularization mechanism to adaptively regularize items and their neighboring representations. Via a comparison using three real-world datasets with state-of-the-art methods, we show that the proposed model outperforms the best existing models by 2-16 recommendation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

With the rapid growth of Internet services and mobile devices, personalized recommender systems play an increasingly important role in modern society. They can reduce information overload and help satisfy diverse service demands. Such systems bring significant benefits to at least two parties. They can: (i) help users easily discover products from millions of candidates, and (ii) create opportunities for product providers to increase revenue.

To provide a more accurate and interpretable recommendation service, knowledge graphs (KGs) are being incorporated into recommender systems. A KG is a heterogeneous graph, where nodes function as entities and edges represent relations between the entities. This is an effective data structure to model relational data, e.g., two movies directed by the same director. Several recent works have integrated KGs into the recommendation model, and the approaches can be divided into two branches: path-based DBLP:conf/kdd/HuSZY18; DBLP:conf/cikm/WangZWZLXG18 and regularization-based DBLP:conf/kdd/ZhangYLXM16; DBLP:conf/kdd/Wang00LC19. Path-based methods extract paths from the KG that carry the high-order connectivity information and feed these paths into the predictive model. To handle the large number of paths between two nodes, researchers have either applied path selection algorithms to select prominent paths or defined meta-path patterns to constrain the paths. By contrast, regularization-based methods devise additional loss terms that capture the KG structure and use these to regularize the recommender model learning.

Although many effective models have been proposed, we argue that there are still several avenues for enhancing performance. First, previous works learn the KG representations in the Euclidean space. As has been observed in other application domains, this may not effectively capture the hierarchical structure that is known to exist within KGs DBLP:conf/icml/SalaSGR18. Second, methods like CKE DBLP:conf/kdd/ZhangYLXM16, CFKG DBLP:journals/algorithms/AiACZ18, and RippleNet DBLP:conf/cikm/WangZWZLXG18 do not distinguish between neighboring entities, adjusting according to their relative importance and informativeness, when learning the representation of each entity. This may lead to undesirable blurring of information from relations in the KG and an incomplete understanding of an entity. Third, all the regularization-based methods adopt a fixed hyper-parameter. We argue that the regularization degree should be adaptive, taking on different values for different entities according to the relevance and value of the information from the knowledge graph. Furthermore, different training phases may need different magnitudes of regularization power values, so the hyper-parameter values should evolve during training.

To tackle the aforementioned problems, we propose a knowledge-enhanced recommendation model in the hyperbolic space, namely Hyper-Know

, to tackle the top-K recommendation task. In particular, we map the entity and relation embeddings of the KG as well as user and item embeddings to the Poincaré ball model. This allows us to capture the hierarchical structure in the KG. We incorporate an attention model in the hyperbolic space, and use the Einstein midpoint for aggregation, in order to form a representation of the neighborhood of each item in the knowledge graph. We then use a regularization term to encourage the representation of an item to remain close to the representation of its neighborhood (in the hyperbolic space). This transfers the relational and structural information from the knowledge graph to the recommendation model. To adaptively control the regularization effect, we model the learning of adaptive and fine-grained regularization factors as a bilevel (inner and outer) optimization problem 

DBLP:journals/tec/SinhaMD18. We build a proxy function to explicitly link the learning of the regularization related parameters with the outer objective function. We extensively evaluate our model on three real-world datasets, comparing it with many state-of-the-art methods using a variety of performance validation metrics. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the proposed components.

To summarize, the major contributions of this paper are:

  • [leftmargin=*]

  • To model the hierarchical structure of KG, we map the entity and relation embeddings of the KG into the Poincaré ball along with user and item embeddings. To the best of our knowledge, ours is the first work to consider knowledge-enhanced recommendation in the hyperbolic space.

  • To transfer the knowledge from the KG to the recommendation model, we incorporate hyperbolic attention and use the Einstein midpoint to aggregate the neighboring entities of an item to form a neighborhood representation.

  • To learn the adaptive regularization factors, we cast the learning process as a bilevel optimization problem and build a proxy function to explicitly update the regularization-related parameters.

  • Experiments on three real-world datasets show that Hyper-Know significantly outperforms the state-of-the-art methods for the top-K recommendation task.

Related Work

General Recommendation

Early recommendation studies largely focused on explicit feedback DBLP:conf/www/SarwarKKR01; DBLP:conf/kdd/Koren08. The recent research focus is shifting towards implicit data DBLP:conf/cikm/TranLL018. Collaborative filtering (CF) with implicit feedback is usually treated as a Top-K item recommendation task, where the goal is to recommend a list of items to users that users may be interested in. It is more practical and challenging DBLP:conf/icdm/PanZCLLSY08, and accords more closely with the real-world recommendation scenario. Early works mostly rely on matrix factorization techniques DBLP:conf/icdm/HuKV08; DBLP:conf/uai/RendleFGS09

to learn latent features of users and items. Due to their ability to learn salient representations, (deep) neural network-based methods 

DBLP:conf/www/HeLZNHC17; DBLP:conf/icdm/SunZMCGTH19; DBLP:conf/kdd/MaKL19

are also adopted. Autoencoder-based methods 

DBLP:conf/wsdm/WuDZE16; DBLP:conf/cikm/MaZWL18; DBLP:conf/wsdm/MaKWWL19 have also been proposed for Top-K recommendation. In DBLP:conf/kdd/LianZZCXS18; DBLP:conf/ijcai/XueDZHC17

, deep learning techniques are used to boost the traditional matrix factorization and factorization machine methods. Recently, some methods are also conducted in the hyperbolic space. HyperML 

DBLP:conf/wsdm/TranT0CL20 conducts metric learning in the hyperbolic space and outperforms Euclidean counterparts. DBLP:conf/sigir/FengTCCLL20 propose to tackle the next Point-of-Interest recommendation task in the hyperbolic space.

Knowledge Graph Enhanced Recommendation

Knowledge graphs (KGs) are an important means to represent side information of recommender systems and have proven to be helpful to improve the recommendation performance. For example, DBLP:conf/kdd/ZhangYLXM16 propose to apply the TransR method DBLP:conf/aaai/LinLSLZ15 to learn the KG representation as well as the item embeddings in the KG. DBLP:journals/algorithms/AiACZ18 integrate users and items with the KG and jointly learn the recommendation and KG part. DBLP:conf/www/WangZZLXG19 propose a multi-task feature learning approach for knowledge graph enhanced recommendation, where these two parts are connected with a cross-and-compress unit to transfer knowledge and share regularization of items. Another track of research tries to perform propagation over the KG to assist in recommendation. Specifically, RippleNet DBLP:conf/cikm/WangZWZLXG18 extends the user’s interests along KG links to discover her potential interests by introducing preference propagation, which automatically propagates users’ potential preferences and explores their hierarchical interests in the KG. KPRN DBLP:conf/aaai/WangWX00C19 constructs the extracted path sequence with both the entity embedding and the relation embedding. These paths are encoded with an LSTM layer and the preferences for items in each path are predicted through fully-connected layers. KGCN DBLP:conf/www/WangZXLG19 studies the utilization of Graph Convolutional Networks (GCNs) for computing embeddings of items via propagation among their neighbors in the KG. Recently, KGAT DBLP:conf/kdd/Wang00LC19 recursively performs propagation over the KG via a graph attention mechanism that refines entity embeddings. Several subsequent works DBLP:conf/sigir/ChenZMLM20; DBLP:conf/www/WangX000C20 focus on optimizing the negative sampling procedure in knowledge-enhanced recommendation. In this paper, we report results for our proposed method using a vanilla negative sampling strategy, so that we can focus on the performance impact of the novel aspects: learning in the hyperbolic space, using hyperbolic attention with Einstein midpoint aggregation, and introducing adaptive regularization. But the advanced negative sampling strategies can also be incorporated into our proposed method to provide a further performance improvement.

Our proposed model distinguishes itself from previous models by learning knowledge-enhanced recommendation in the Poincaré ball model. In addition, we employ a hyperbolic attention model in the hyperbolic space to assign different degrees of importance to the neighboring entities of a certain item. We introduce a bilevel optimization formulation of the learning task to achieve an adaptive regularization mechanism that controls the regularization effect.

Preliminaries

Problem Formulation

The knowledge-based recommendation considered in this paper takes as inputs the user implicit feedback and the item knowledge graph. The implicit feedback is represented by a number of user-item pairs , where is the user set and is the item set. The item knowledge graph can be formulated as a set of triples, each consisting of a relation and two entities , referred to as the head and tail of the triple.

Then the top- recommendation task in this paper is formulated as: given the training item set of user , and the non-empty test item set (requiring that and ) of user , the model must recommend an ordered set of items such that and . Then the recommendation quality is evaluated by a matching score between and , such as Recall@.

Hyperbolic Geometry of the Poincaré Ball

The Poincaré ball model is one of five isometric models of hyperbolic geometry cannon1997hyperbolic, which is a non-Euclidean geometry with constant negative curvature. Formally, the Poincaré ball of radius is a -dimensional manifold equipped with the Riemannian metric which is conformal to the Euclidean metric with the conformal factor , i.e., The distance between two points is measured along a geodesic (i.e. a shortest path between the points) and is given by:

(1)

where denotes the Euclidean norm and represents Möbius addition DBLP:conf/nips/GaneaBH18:

(2)

User & ItemEmbeddings

KG ControlParameters

Proxy Function

 forward pass  gradient backward (inner loss)  gradient backward (outer loss)

Figure 1: The procedure of the bilevel optimization to control the learning of fine-grained parameters .

Methodology

In this section, we introduce the proposed model, Hyper-Know, which integrates the knowledge graph with the recommendation task in the hyperbolic space. We first introduce the user preference learning in the hyperbolic space. Then we illustrate the hyperbolic attention mechanism that is used to distinguish items’ neighboring entities in the knowledge graph. We next explain how to adaptively learn the recommendation objective and knowledge graph by a bilevel optimization formulation. Lastly we introduce the training and prediction procedure of the proposed model.

Learning User Preference

User preference modeling lies at the core of recommender systems. Recently, distance metric learning has been widely applied to measure the user preference on items, yielding substantial performance gains DBLP:conf/www/HsiehYCLBE17. In this approach, the distance between user and item is used to measure the user preference on a certain item. To learn the user preference, we apply the Bayesian Personalized Ranking loss DBLP:conf/uai/RendleFGS09 to capture the pairwise preference of a user for an item that the user has accessed compared to a randomly sampled item :

(3)

where , , and ,

is the sigmoid function, and

is the dimension of the manifold. represents the parameters of the recommender model.

Regularizing Neighboring Entities

Knowledge graphs (KGs), consisting of (head entity, relationship, tail entity) triples, are efficient data structures for representing factual knowledge and are widely used in applications such as question answering DBLP:conf/aaai/ZhangDKSS18. Recently, KGs have been applied in recommender systems to not only enhance the recommendation performance but also provide interpretable recommendation results.

To effectively exploit KGs in recommender systems, we treat them as relational inductive biases DBLP:journals/corr/abs-1806-01261 between items. During the learning process, the relations in the KG can be used as regularizers; if two items link to a common entity or multiple common entities in the KG which suggests that a user might have similar preferences for the items. However, an item can link to multiple entities in the KG and the relative importance of different entities can differ greatly. Moreover, the entities can contribute in different ways to the description of the item. This motivates us to propose an attention mechanism in the Poincaré ball model.

Considering an item , we use to denote the set of neighboring triples for which is the head entity. Then we apply a TransE-style DBLP:conf/nips/BordesUGWY13 scoring function () to calculate the matching score between an item and its neighboring entity in :

(4)

The usual way to aggregate multiple attentions in the Euclidean space is weighted midpoint aggregation. The corresponding operation in the hyperbolic space is not immediately obvious, but fortunately, the extension to hyperbolic space does exist in the form of the Einstein midpoint. It has a simple form in the Klein disk model cannon1997hyperbolic:

(5)

where the elements of are the Lorentz factors and is the function to transform the coordinates from the Poincaré ball model to the Klein disk model. The Klein model is supported on the same space as the Poincaré ball, but the same point has different coordinates in each model. Let and denote the coordinates of the same point in the Poincare and Klein models correspondingly. Then the following transition formulas hold:

(6)

We call in (5) the neighborhood representation of item . During the training process we add a regularizing term that encourages the neighborhood representation to be close to the item’s representation . The goal is to transfer the inductive bias in KG to the item representation:

(7)

Combining with the user preference learning objective , the overall knowledge-enhanced objective can be:

(8)

where is to balance the effect from the KG.

Adaptive and Fine-grained Regularization

Previous works DBLP:conf/kdd/Wang00LC19; DBLP:conf/cikm/WangZWZLXG18; DBLP:conf/kdd/ZhangYLXM16 that derive information from a KG in the recommender setting use a single and fixed number for in Eq. 8 to train the overall objective. However, employing a single fixed value for can have several drawbacks. First, different datasets may require different impact levels of regularization from KGs. Treating as a fixed value requires extra hyper-parameter searching procedure for each dataset to better realize the power of KGs. Second, different items may need different degrees of regularization. Using the same value for every item would limit the achievable performance improvement that can be derived from the KG information. Third, in different training phases, the model may need different magnitudes of regularization power.

To address the problems outlined above, we propose an adaptive regularization scheme to apply different strengths of regularization to each item and to adjust the strength throughout training. We formulate Eq. 8 as:

(9)

where is -th value of and where is the sigmoid function. Unfortunately, directly minimizing this objective function is not able to achieve the desired purpose of adaptively controlling the regularization. The reason is that, considering

explicitly appears in the loss function, constantly decreasing the value of

is the straightforward way to minimize the loss. As a consequence, instead of reaching optimal values for the model, all will end up with very small values close to zero, leading to unsatisfactory results.

To tackle the above problem, we model the learning of recommendation models and the adaptive regularization of KG as a bilevel optimization problem DBLP:journals/anor/ColsonMS07:

(10)

Here contains the model parameters , , and . The objective function attempts to minimize with respect to with fixed. Meanwhile, the objective function optimizes with respect to through , considering as a function of .

Initialize optimizers and ;
while not converged do
       Update (fix ):
       ;
       Proxy:
       ;
       Update (fix ):
       ;
      
end while
Algorithm 1 Iterative Training Procedure

As most existing models use gradient-based methods for optimization, a simple approximation strategy with less computation is introduced as follows:

(11)

In this expression, is the learning rate for one step of inner optimization. Related approximations have been validated in DBLP:conf/wsdm/Rendle12; DBLP:conf/iclr/LiuSY19; DBLP:conf/kdd/MaMZTLC20. Thus, we can define a proxy function to link with the outer optimization:

(12)

For simplicity, we use two optimizers and to update and , respectively. The iterative procedure is shown in Alg. 1.

Training and Prediction

After incorporating a parameter regularization term to avoid overfitting, the overall loss function is:

(13)

where is a hyper-parameter. When minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with back-propagation. We apply the Adam DBLP:journals/corr/KingmaB14 algorithm to automatically adapt the learning rate during the learning procedure.

Recommendation Phase. For user , we compute the distance between the user and each item in the dataset. Then the items that are not in the training set and have the shortest distances are recommended to user .

Evaluation

In this section, we first describe the experimental set-up. We then report the results of the conducted experiments and demonstrate the effectiveness of the proposed modules.

Datasets

The proposed model is evaluated on three real-world datasets from various domains with different sparsities: Amazon-book, Last-FM and Yelp2018, which are fully adopted from DBLP:conf/kdd/Wang00LC19. The Amazon-book dataset is adopted from the Amazon review dataset DBLP:conf/www/HeM16 with the book category, which covers a large amount of user-item interaction data, e.g., user ratings and reviews. The Last-FM dataset is collected from last.fm music website, where the tracks are viewed as the items. A subset of data from Jan. 2015 to Jun. 2015 is selected. The Yelp2018 dataset is adopted from the 2018 edition of the Yelp challenge, where local businesses like restaurants and bars are viewed as the items.

All the above datasets follow the 10-core setting to ensure that each user and item have at least ten interactions. For Amazon-book and Last-FM, items are mapped into Freebase entities via title matching if there is a mapping available. For Yelp2018, the item knowledge from the local business information network (e.g., category, location, and attribute) is extracted as KG data. The data statistics after preprocessing are shown in Table 1.

For fair comparison, these three datasets in our experiments are exactly the same as those used in DBLP:conf/kdd/Wang00LC19. For each dataset, 80% of interaction data of each user is randomly selected to constitute the training set, and we treat the remaining 20% as the test set. From the training set, 10% of interactions are randomly selected as validation set to tune hyper-parameters. The experiments are executed five times and the average result is reported.

Amazon-book Last-FM Yelp2018
#Users 70,679 23,566 45,919
#Items 24,915 48,123 45,538
#Interactions 847,733 3,034,796 1,185,068
#Entities 88,572 58,266 90,961
#Relations 39 9 42
#Triplets 2,557,746 464,567 1,853,704
Table 1: The statistics of the datasets.
FM NFM CKE CFKG RippleNet GC-MC KGAT Hyper-Know Improv.
Recall@20
Amazon-book 0.1345 0.1366 0.1343 0.1142 0.1336 0.1316 0.1489 0.1534* 3.23%
Last-FM 0.0778 0.0829 0.0736 0.0723 0.0791 0.0818 0.0870 0.0949* 9.08%
Yelp2018 0.0627 0.0660 0.0657 0.0522 0.0664 0.0659 0.0712 0.0683 N/A
NDCG@20
Amazon-book 0.0886 0.0913 0.0885 0.0770 0.0910 0.0874 0.1006 0.1075* 6.86%
Last-FM 0.1181 0.1214 0.1184 0.1143 0.1238 0.1253 0.1325 0.1533* 16.70%
Yelp2018 0.0768 0.0810 0.0805 0.0644 0.0822 0.0790 0.0867 0.0897* 3.46%
Table 2: The performance comparison of all methods in terms of Recall@20 and NDCG@20. The best performing method is boldfaced. The underlined number is the second best performing method. * indicates the statistical significance for

compared to the best baseline method based on the paired t-test.

Evaluation Metrics

We evaluate all the methods in terms of Recall@K and NDCG@K. For each user, Recall@K (R@K) indicates what percentage of her rated items emerge in the top recommended items. NDCG@K (N@K) is the normalized discounted cumulative gain at , which takes the position of correctly recommended items into account.

Methods Studied

To demonstrate the effectiveness of our model, we compare to the following recommendation methods:

  • FM DBLP:conf/icdm/Rendle10, a classical factorization model, which incorporates the second-order feature interactions between input features.

  • NFM DBLP:conf/sigir/0001C17, a state-of-the-art factorization model, which subsumes FM under a neural network.

  • CKE DBLP:conf/kdd/ZhangYLXM16, a representative regularization-based method, which exploits semantic embeddings derived from TransR DBLP:conf/aaai/LinLSLZ15 to enhance the matrix factorization.

  • CFKG DBLP:journals/algorithms/AiACZ18, a model that applies TransE DBLP:conf/nips/BordesUGWY13 on the unified graph including users, items, entities, and relations, casting the recommendation task as the prediction of (u, Interact, i) triplets.

  • RippleNet DBLP:conf/cikm/WangZWZLXG18, a model that combines regularization- and path-based methods, which enrich user representations by adding those of items within paths rooted at each user.

  • GC-MC DBLP:journals/corr/BergKW17, a model designed to employ a graph convolutional network on graph-structured data. Here the model is applied on the user-item knowledge graph.

  • KGAT DBLP:conf/kdd/Wang00LC19, a state-of-the-art KG enhanced model, which employs a graph neural network and an attention mechanism to learn from high-order graph-structured data for recommendation.

  • Hyper-Know, the proposed model, which learns the knowledge-enhanced recommendation in the Poincaré ball and applies hyperbolic attention for distinguishing neighboring entities and bilevel optimization for adaptive regularization, respectively.

Experiment Settings

In the experiments, the latent dimension of all the models is set to 64. The parameters for all baseline methods are initialized as in the corresponding papers, and are then carefully tuned to achieve optimal performances. The learning rate is tuned amongst , and we search for the coefficient of L2 normalization over the range . To prevent overfitting, the dropout ratio is selected from the range for NFM, GM-MC, and KGAT. The dimension of attention network is tested over the values . Regarding NFM, the number of MLP layers is set to with neurons according to the original paper. For RippleNet, we set the number of hops and the memory size as and , respectively. For KGAT, we set the depth as with hidden dimension , , and , respectively. The network architectures of the above methods are configured to be the same as described in the original papers. For Hyper-Know, the curvature is set to 1 and the batch size is set to

. The hyper-parameters are tuned on the validation set. Our experiments are conducted with PyTorch running on GPU machines (NVIDIA Tesla V100).

Performance Comparison

The performance comparison results are shown in Table 2.

Observations about our model

. First, the proposed model—Hyper-Know, achieves the best performance for most evaluation metrics on three datasets, which illustrates the superiority of our model. Second, Hyper-Know outperforms KGAT on the Amazon-book and Last-FM datasets. Although KGAT adopts the attention model to distinguish the entity importance in the knowledge graph, it may not effectively capture the hierarchical structure between entities, which can be well-modeled by learning the entity and relation embeddings in the hyperbolic space. One possible reason why Hyper-Know does not outperform KGAT for the Recall@20 metric on the Yelp2018 dataset is that most of the entities in this KG are linked according to whether they have the same attributes, such as

HasTV. Most of these attributes are very generic, which means that the KG provides information of limited value. As a result, much of the transfer that Hyper-Know performs from the KG to the recommendation part for the Yelp2018 dataset is likely to be noise. Third, Hyper-Know achieves better performance than GC-MC and RippleNet. Although GC-MC and RippleNet can model high-order connectivities, they fail to identify the important entities that would make a difference in recommendation. On the other hand, Hyper-Know employs an attention model in the hyperbolic space to learn the neighborhood representation of an item and transfers the knowledge from the KG to the item representation via regularization. Fourth, Hyper-Know obtains better results than CKE. One possible reason is that CKE adopts a fixed power of regularization during the whole training process. By contrast, Hyper-Know performs fine-grained regularization to regularize the item and its neighborhood. Fifth, Hyper-Know outperforms FM and NFM. One reason may be that using a distance as the scoring function can capture more fine-grained user preference.

Other observations. First, KGAT outperforms GC-MC and RippleNet. KGAT is capable of exploring the high-order connectivity in an explicit way and applies a graph attention model to aggregate the neighbors in the user-item knowledge graph in a weighted manner. Second, FM and NFM achieve better performance than CFKG and CKE in most cases. One major reason is that FM and NFM capture the second-order connectivity between users and entities, whereas CFKG and CKE model connectivity on the granularity of triples, leaving high-order connectivity untouched. Third, RippleNet achieves better performance than FM. This may verify that incorporating two-hop neighboring items is of importance to enrich user representations. Fourth, NFM performs better than FM. One major reason is that NFM has stronger expressiveness, since the hidden layer allows NFM to capture the nonlinear and complex feature interactions between user, item, and entity embeddings.

Architecture Amazon-book Last-FM
R@20 N@20 R@20 N@20
(1) BPR+E 0.1017 0.0729 0.0604 0.1112
(2) BPR+H 0.1167 0.0833 0.0656 0.1191
(3) BPR+Att+E 0.1121 0.0812 0.0746 0.1319
(4) BPR+Att+H 0.1447 0.1025 0.0885 0.1453
(5) BPR+Avg+H 0.1250 0.0897 0.0775 0.1358
(6) Hyper-Know 0.1534 0.1075 0.0949 0.1533
Table 3: The ablation analysis. Att denotes the attention model, Avg denotes the embedding average operation, E denotes the Euclidean space, and H denotes the hyperbolic space.

Ablation Analysis

To verify the effectiveness of the proposed model in the Poincaré ball, the hyperbolic attention model, and the adaptive regularization mechanism, we conduct an ablation study in Table 3. This demonstrates the contribution of each module to the Hyper-Know model. In (1), we use the Euclidean distance to measure the user preference optimized by the BPR loss. In (2), we apply the distance in the Poincaré ball to measure users’ preferences and optimize using Eq. 3. In (3), we integrate the TransE-style attention on the top of (1) in the Euclidean space. In (4), we add hyperbolic attention to (2). In (5), we replace the attention model in (4) with an average operation in the hyperbolic space. In (6), we present the overall Hyper-Know model to show the effectiveness of the adaptive regularization mechanism.

From the results shown in Table 3, we make the following observations. First, comparing (1) and (2), we can observe that measuring the user preference by calculating distance in the hyperbolic space achieves better performance than calculating distance in the Euclidean space. This confirms the results reported in DBLP:conf/wsdm/TranT0CL20. Second, from (2) and (3), we observe that incorporating the hyperbolic attention model significantly improves the model performance. Third, in (3) and (4), we compare the performance of the attention model in both the Euclidean and hyperbolic space. From the results, we can observe that the attention model achieves better results in the hyperbolic space than in the Euclidean space. Fourth, from (1), (2), (3) and (4), we can observe that equipping the recommendation model with the KG either in the Euclidean space or hyperbolic space can improve the recommendation performance. Fifth, from (4) and (5), we observe that by distinguishing the importance of each neighbour of an item through attention, we achieve considerable improvement compared to a simple average. Comparing (4) and (6), we can observe that the adaptive regularization can provide the fine-grained regularization power.

CKE CFKG KGAT Hyper-Know
Amazon-book 55s 22s 457s 15s
Last-FM 53s 27s 137s 22s
Yelp2018 63s 37s 352s 20s
Table 4: Training time comparison.

Training Efficiency

In this section, we compare the training efficiency with other state-of-the-art KG-enhanced methods in terms of the training speed. We compare the time taken for one epoch of training. From the results reported in 

DBLP:conf/sigir/ChenZMLM20, these compared methods take a similar number of epochs to converge as well as our proposed method. RippleNet is not computation-efficient and takes much longer to train, we omit the comparison with RippleNet. All the experiments are conducted on a single GPU of an NVIDIA Tesla V100. All the compared methods are executed for 20 epochs and we report the average computation time, which is shown in Table 4. The training time comparison shows that Hyper-Know is more computationally efficient than other state-of-the-art methods, and the reason follows. Compared to CKE, Hyper-Know has a smaller number of learnable parameters (8.3 million v.s. 11.4 million on the Last-FM dataset). Compared to KGAT and CFKG, Hyper-Know does not incorporate the users into the KG, which makes the scale of the KG much smaller.

(a) on Amazon-book
(b) on Last-FM
Figure 2: The variation of .

Influence of Hyper-parameters

The value of that regularizing the item embedding with its neighborhood is an important hyper-parameter if not using the adaptive regularization mechanism. Its effect on the Amazon-book and Last-FM datasets is shown in Figure 2.

From the results in Figure 2, we observe that the value of does affect the recommendation performance, with performance deteriorating by as much as 10 percent if a suboptimal value is chosen. Furthermore, there is no fixed value that achieves performance that is as good as the performance obtained by using the proposed adaptive mechanism. These results demonstrate that the fine-grained and adaptive regularization benefits the recommendation task, which confirms the results reported in DBLP:conf/wsdm/Rendle12.

(a) Entity-48130
(b) Entity-97468
Figure 3: The embedding visualization of selected entities.

Embedding Visualization

To verify whether the learned embedding in the Poincaré ball can capture the hierarchical structure in the knowledge graph, we train Hyper-Know in the 2D space on the Last-FM dataset and visualize the entities in the 2D hyperbolic space. We randomly select two nodes and their two-hop neighbors to visualize. The visualization is shown in Figure 3. The biggest dot denotes the selected entity, the less biggest dot denotes the first-hop neighbor of the selected entity, and the smallest dot denotes the second-hop neighbor.

From Figure 3, we can observe that these three kinds of nodes may form the hierarchical patterns in Poincaré ball and the learned embeddings in the hyperbolic space can represent the hierarchical relationships.

Conclusion

In this paper, we propose a knowledge-enhanced recommendation model in the hyperbolic space (Hyper-Know) for top-K recommendation. Hyper-Know learns the user and item embeddings as well as the knowledge graph representation in the Poincaré ball model to capture the hierarchical structure in the knowledge graph. In addition, we incorporate hyperbolic attention to select the most important neighboring entities of each item. To adaptively control the regularization effect, a bilevel optimization mechanism is proposed to generate a fine-grained regularization effect between recommendation and the knowledge graph. Experimental results on three real-world datasets clearly validate the performance advantages of our model over multiple state-of-the-art methods and demonstrate the effectiveness of each of the proposed constituent modules.

References