Bayes EMbedding (BEM): Refining Representation by Integrating Knowledge Graphs and Behavior-specific Networks

08/28/2019 ∙ by Yuting Ye, et al. ∙ berkeley college 0

Low-dimensional embeddings of knowledge graphs and behavior graphs have proved remarkably powerful in varieties of tasks, from predicting unobserved edges between entities to content recommendation. The two types of graphs can contain distinct and complementary information for the same entities/nodes. However, previous works focus either on knowledge graph embedding or behavior graph embedding while few works consider both in a unified way. Here we present BEM , a Bayesian framework that incorporates the information from knowledge graphs and behavior graphs. To be more specific, BEM takes as prior the pre-trained embeddings from the knowledge graph, and integrates them with the pre-trained embeddings from the behavior graphs via a Bayesian generative model. BEM is able to mutually refine the embeddings from both sides while preserving their own topological structures. To show the superiority of our method, we conduct a range of experiments on three benchmark datasets: node classification, link prediction, triplet classification on two small datasets related to Freebase, and item recommendation on a large-scale e-commerce dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs widely exist in the real world, including social networks [15, 23], physical systems [2, 36], protein-protein interaction networks [11], knowledge graphs [14] and many other areas [20]. There may be different views of the same set of nodes, and thus graphs of different architectures are built. For example, in the e-commerce industry, item-item networks can be constructed based on the user behaviors of clicks, purchases, add-to-preferences and add-to-carts respectively — two items are linked if they are clicked (or via other operations) by the same user. A corresponding knowledge graph can be crafted to represent a collection of interlinked descriptions of the items, e.g., color, materials, functions. Throughout this article, we refer to the graph with respect to a certain behavior/context as the behavior graph (BG)111The concept of BG covers a wide range of conventional graphs and networks: pagelink networks (the link behavior), author-citation networks (the citation behavior), item-item interaction behavior (the co-click, co-purchase behaviors and etc.), to name a few., in order to distinguish it from the knowledge graph (KG) that consists of structured symbolic knowledge (triplets). KG and BG both reflect the interactions between entities/nodes in reality, but they differ in two aspects: 1) the graph structures; 2) the contained information; see Section 2 for a detailed discussion. The connection and the distinction between KG and BG imply that they can be complementary to each other. It is of great interest to integrate these two types of graphs in a unified way.

Benefits of the integration of KG and BG. In the sequel, we give three perspectives with examples in the e-commerce industry to illustrate the benefits of incorporating KG and BG. First, KG-aided BG can achieve accurate recommendations. For instance, given a formal dress and high-heel shoes, methods based on BG alone may recommend arbitrary lipsticks. With information from the KG, it can make a better recommendation of formal lipsticks instead of sweet lipsticks, as KG has the knowledge that the dress and the shoes are associated with formal occasions. Second, KG-aided BG can do more than BG alone. Suppose a user buys a ticket to Alaska in January, the knowledge “enjoying aurora in Alaska in winter” is triggered in KG. So it can recommend down jacket, outdoor shoes and tripods for the aurora viewing in a freezing environment. But methods using BG-only embeddings can hardly connect the flight ticket to such outfits. Third, novel knowledge can be discovered from BG on top of the known. For example, recent clothing fashions can be inferred by the frequently co-clicked or co-purchased clothes. Then humans’ common sense or other experts’ knowledge can be used to identify the most likely choice of the fashion of this year.


To deal with multiple graphs, a standard practice is to embed the nodes as vectors while simultaneously integrating the information from all the sources

[47, 24, 48]. To the best of our knowledge, however, there is no existing method that jointly learns the BG embedding and the KG embedding. As an alternative solution, it is common to take the pre-trained KG/BG embeddings as the input to learn the representation of BG/KG [43, 44, 15]. Or one can simply learn the embeddings of KG and BG separately, then incorporate them via an aggregation method, e.g., concatenation, linear combination. For the first strategy, the interaction information contained in the KG/BG embedding can be distorted if it does not agree with that of BG/KG. For the second strategy, the topological structure from either side is either disguised (e.g., concatenating a short embedding with a long embedding) or destroyed (e.g., taking the average of two embeddings of the same length). In this article, we work with the pre-trained BG and KG embeddings as this strategy is widely applicable. Our goal is to integrate BG and KG without losing the topological information from both sides.

Contribution. Throughout this paper, we consider only one KG and one BG. We develop a Bayesian framework called BEM (Bayes EMbedding) that refines the KG and BG embeddings in an integrated fashion while preserving and revealing the topological information from the two sources. The key idea behind BEM is that the KG embedding, plus a behavior-specific bias correction term, acts as the prior information for the generation of the BG embedding; see Figure 1 (c). BEM aims to maximize the likelihood under this Bayesian generative model. Our contribution is twofold. From the perspective of modelling, BEM is proposed to bridge KG and BG seamlessly, with the consideration of their respective topological structures. As a framework, BEM is general and flexible in that it can take any pre-trained KG embeddings and any BG embeddings to mutually refine themselves.

The rest of the paper is organized as follows. In Section 2, we discuss the difference and connection between KG and BG. In Section 3, we review works that are related to our method. In Section 4, we present our method BEM. In the sequel, we demonstrate the utility of BEM in three application studies involving two small datasets related to Freebase and a large dataset in e-commerce (Section 5). We test the BEM-refined embeddings in varieties of downstream tasks. Finally, we conclude with a discussion of the BEM framework and highlight promising directions for future work in Section 6.

Figure 1: (i) Examples of KG and BG; (ii) The workflow of BEM: 1) Embed KG/BG; 2) Train BEM with the parameters of the generative model in (iii); 3) Feed the original embeddings and trained parameters into BEM for refining; 4) Refined (corrected) KG/BG embeddings. (iii) A top-down generative model (Equation (1)) that connects one KG and three BGs with different behaviors a, b, c. First, for each behavior, there exists a behavior-specific correction term

that accounts for the associative bias. Then the refined KG embedding is projected into the BG space via a non-linear transformation function

. Finally, the BG embeddings are sampled from a distribution given the projected KG embedding. The model is trained to find the optimal that maximizes the likelihood of observing the BG embeddings ’s given the KG embeddings ’s.

2 Discussion of KG and BG

Here we discuss the difference and the connection between KG and BG to illustrate three points: 1) KG and BG are different and hard to jointly learn; 2) KG and BG contain complementary (distinct but related) information, and therefore it is promising to get better embeddings by integrating the two types of graphs; 3) KG and BG can be unified from two reasonable perspectives.

Difference between KG and BG. There are mainly two differences between KG and BG. First, KG encodes entities and their relations in the form of the triplet as , where , and are the head entity, the tail entity and their relation. It corresponds to a directed and highly heterogeneous network. In comparison, BG is constructed based on the interplay between the nodes under certain task/behavior-specific contexts. It corresponds to an undirected network with limited number of edge types (homogeneous or less heterogeneous than KG). Figure 1 (a) shows the difference between KG and BG in terms of the network structure. The distinction in structure makes it difficult to put the two graphs in a single framework for embedding learning. Second, the triplets in KG are extracted from authentic knowledge and experience. Thus, KG is a semantic network reflecting relatively objective facts that can stand the test of time. As for BG, it embodies a time-varying and behavior-biased link between nodes, which we illustrate with two examples: 1) People may buy sunglasses and swimwear at the same time in summer, but they will barely purchase these two items in winter; 2) Two sorts of sunglasses can be viewed (the click behavior) for comparison but they are rarely bought (the purchase behavior) together. The difference in information between KG and BG indicates that they can complement each other.

Connection between KG and BG.

Despite the distinction, KG and BG are also closely related, resembling the connection between humans’ knowledge and experience. KG can be regarded as an abstracted graph that reflects the shared properties among multiple BGs. This bottom-up idea (from BGs to KG) implies that it is possible to acquire novel knowledge from all kinds of BGs. On the contrary, we can heuristically interpret the connection from top down, as shown in Figure

1 (c). KG contains the general information of items, e.g., the item properties (color, materials etc.), the category of the item, the concepts/scenarios222Scenarios are manually crafted to include items that appear together frequently under certain conditions. For example, the sunglasses and the swimwear both belong to the scenario “summer-beach”.of the category. Then, the node of BG can be thought of as being generated by adjusting the associative entity in KG with a behavior-specific correction term. For instance, the cell-phone is conceptually a portable electronics (KG). It exhibits varieties of properties under different scenarios (BG), e.g., a communication tool when connecting to others, an entertainment platform when playing games, a working/studying tool when looking up information online. The top-down idea indicates that we can use KG information to help the learning of BG.

3 Related Work

In this section, we review related work to our method. As to the best of our knowledge, there is no existing method that learns the BG embeddings and the KG embeddings jointly. We first introduce multi-view learning that is closest to this goal. Then we review alternative methods, followed by classic representation learning methods for conventional graphs and knowledge graphs.

3.1 Multi-view Embedding Learning

In real life, entities may have different feature subsets which is called multi-view data. For instance, in e-commerce, an item may be associated with different behavior data in different scenarios, such as the data of purchases, clicks, add-to-preferences and add-to-carts. These multi-view data can be learned to get a uniform representation for one item. For this purpose, varieties of approaches have been proposed, including co-training, multiple kernel learning, and subspace learning [47, 24, 48]. In particular, many efforts have been made in multi-view network representation learning. Qu et al. [35] combines the embeddings of different network views linearly. Shi et al. [37] proposes two characteristics (preservation and collaboration), and gets node vectors by simultaneously modeling them. It is closely related to our work in the sense that it emphasizes the integration of different sources while preserving their own specialties. However, it only deals with homogeneous networks as other multi-view embedding learning methods. In contrast, our method is designed to combine BG with KG, which differ in the data structures and the contained information.

3.2 Alternative Ways to Integrate KG and BG

There are alternative approaches to integrate KG and BG. First, the standard practice is to embed one graph into vectors, then take the embeddings as the input of the learning for the other graph. For example, Wu et al. [43] embeds sequential texts, then takes them as node/entity attributes for knowledge graph learning. Xie et al. [44] learns knowledge graph embedding by using the embeddings of entity descriptions. Hamilton et al. [15] can take as input the pre-trained KG embeddings to learn BG embeddings as well. However, this line of works tends to focus on the targeted graph (the graph that uses the pre-trained embedding for learning), but the topological structures from the other graph (the graph that generates the pre-trained embedding) may be missing. Even though interaction information between nodes is contained in the pre-trained embeddings, it can be weakened or ignored if not agreeing with the topology from the targeted graph. Second, there is a even simpler strategy to integrate KG and BG, i.e., learning the embeddings of KG and BG separately, then incorporating them via an aggregation method, e.g, concatenation, linear combination [15]. Nonetheless, the topological structures from both sides are disguised or destroyed by these aggregation methods. Our work falls in the second category, and is designed to solve the above issue: it preserves and reveals the topological information when integrating BG and KG.

3.3 Representation Learning for BG and KG

Here we review methods used to pre-train BG and KG embeddings. A line of works perform graph embedding based on graph spectrum [3, 40]. Some works use matrix factorization to get node embeddings [45, 8, 46]

. Additionally, simple neural networks are used to generate embeddings by making the distribution of the node embeddings close to that obtained by the topological structure

[34, 39]. Recently, some graph neural network based techniques are also proposed and widely applied [23, 31, 15, 41].

Since KG differs from BG due to the semantic links between entities, the above embedding methods are not applicable to KG. Many efforts have been made to embed the nodes in KG. As a seminal work, TransE [5] learns a low dimensional vector for every entity and relation in KGs. Later extensions include TransH [42], TransR [26] and STransE [30] for more flexibilities.

4 Methods

4.1 Notation

We denote and as the KG embedding and the BG embedding with dimension and respectively. For a vector , let be the dimension of , and let be its -th entry. We use for element-wise multiplication, i.e., for two vectors and with length , . Denote as the Kullback-Leibler (KL) divergence between distributions and [6]. Other detailed notations used throughout this section are summarized in Table 1.

Notation Meaning
KG embedding.
BG embedding.
Dimension of /.
The behavior-specific correction term.
The nonlinear transformation that projects the refined (corrected) KG embedding into the BG space.
Projection of the KG embedding onto the behavior space by .
The edge function that characterizes the interaction between entities in the behavior space.
The distribution of .
The distribution of .
The inference network.
/ all the latent variables for /.
Table 1: Notation Table.

4.2 The Generative Model

Section 2 sheds light on the bottom-up and top-down relations between KG and BG. KG is thought of as the abstract representation of an entity, and BG is its realization under certain context. We can view BG as a mix of KG and a context-specific factor (an adjustment term), but usually it only reflects some aspect of KG (i.e., a projection of the mix). Such insights motivate us to connect KG and BG in a generative model as follows.

Throughout this paper, we focus on the case where each entity has one KG embedding and one BG embedding. Mathematically, suppose there are entities and each entity has a KG embedding and a BG embedding . As depicted in Figure 1, and act as priors and observations respectively. We use to model the adjustment effects between and . In other words, acts as an residual to so that is sufficient to determine the marginal distribution of via a projection function . The projection not only reflects the fact that BG characterizes KG partially, but is also technically required to map

into the BG space. To be more specific, we assume the joint distribution of

hinges on the following three components:

  • ”Refined” KG embeddings , where are sampled from the behavior-specific distribution ;

  • The nonlinear transformation that projects the refined KG embedding into the BG space;

  • The distribution of BG embedding .

Then, write the generative model as


Our target is to optimize the following objective function:


However, the objective function Equation (2) under Model (1) is generally intractable. For the sake of computational feasibility, assumptions are needed to simplify the model:

  • To reduce the model complexity, we assume ’s are identically independently distributed, i.e., , where is shared by all the entities.

  • To retain the interaction information between entities, we come up with an edge function that characterizes the interplay between and . For example, can be the similarity or the vector difference between and . Then, is assumed to be a generative distribution for .

  • To further reduce the model complexity, we assume ’s are i.i.d sampled from , where is shared for all pairs of .

Then, Model (1) is reduced to


which is visualized as Figure 2 (a). Compared to Model (1), the reduced model has a much smaller model complexity while retaining the interaction information between entities, i.e., preserving the topological structure, which is crucial for all the BG and KG embedding methods [9, 7]. We call this model BEM-P (“P” denotes pairwise interactions). In comparison, we can ignore the interactions for further complexity reduction:


In fact, Model (4) is a special case of Model (3) by letting and assuming . Then it becomes a model with full independence. We call this model BEM-I (“I” denotes vertex independence). Finally, for the sake of simplicity, we denote BEM-O (“O” denotes NULL) as using the original embeddings directly without applying BEM. All these models are summarized in Table 2. In the sequel, we will omit the subscript , and for simplicity if it does not brings about ambiguity.

Abbreviation Meaning
BEM with node interactions; Model (3).
BEM with full independence; Model (4).
BEM-O Without applying BEM.
-P, -I ( {BG, KG})
The embedding by BEM-P, BEM-I.
-O ( {BG, KG})
The original embedding (by BEM-O).
concat-X (X {P, I, O})
The concatenation of KG-X and BG-X.
Table 2: Abbreviations of models and embeddings

4.3 The Inference Model

Given Equation (3), the objective function (2) can be rewritten as


There are varieties of off-the-shelf methods to optimize Equation (5), such as the EM [29] or MCMC [12] algorithm. But these methods usually fail due to intractability of scalability. To this end, we resort to variational inference [4], which is very popular for large-scale scenarios or distributions with intractable integrals. Let be a set of all the latent variables for node , and . For example, in the generative model Equation (4), , and . It is easy to derive that


where is called the inference model [22], i.e., an approximated density function to the posterior density of given . is the associated prior density. Formula (6) is also called the variational lower bound or evidence lower bound (ELBO) [17] for . The first term in the ELBO is termed as the reconstruction term that measures the goodness of the fit, while the second one is a penalty term that measures the distance between the approximated density to the prior density. Then, our goal of maximizing

can be relaxed to maximizing the ELBO. It is well-known that the naive Monte-Carlo gradient estimator exhibits very high variance and is impractical when

is large [33]. Thus we will utilize particular distributions and introduce additional assumptions to further simplify the ELBO.

We assume to be a multivariate normal density. Assume to be a multivariate normal density with mean and variance matrix , where is the sample-specific variance (see Figure 2 (a)). Here, and

are assumed to be sampled from a multivariate log-normal distribution. We introduce the latent variable

and to account for the nuisance variation induced by sampling (see Section 4.4

). Here we choose the multivariate normal/log-normal distribution because it enjoys appealing statistical and computational properties: 1) normal/log-normal random variables are easy to sample; 2) normal/log-normal distributions can be easily reparametrized with only two parameters

[22]; 3) There is a closed-form expression for the KL divergence between two normal/log-normal distributions.

Figure 2: The BEM-P method with the normal/log-normal distributions with the sample-specific variance for . (a) The generative model (3). The shaded circles represent observed/estimated variables. The empty circles represent latent variables. Edges signify conditional dependency (including deterministic mapping). The solid rectangles (“plates”) indicate independent replication while the dashed rectangles indicate replication only. (b) The inference network indexed by

. It takes in KG-O and BG-O and outputs the posterior means/variances of the latent variables in (a). (c) The computational pipeline that concatenates the inference model (b) and the generative model (a) to produce refined (corrected) KG/BG embeddings. (d) The illustration graph that explains the translation edge function is equivalent to the similarity function using inner product or cosine similarity on the sphere.

By introducing the latent variable , the set of latent variables for node becomes and . We then impose two common conditions in the mean-field variational inference [22]:

  • Both and are from mean-filed family. That is

  • and are normal and log-normal densities with a diagonal covariance matrix, respectively.

Thus, the approximated posterior means and variances of each element in can be represented by a function of and , denoted as , which is called the inference network. In detail,


where , , , are the approximated posterior means and variances (a vector consisting of the diagonal elements of the covariance matrix) of and respectively. With the reparametrization trick, we can express , and . Correspondingly, we express their prior means and variances as , , , , where and are two tuning parameters. Then the ELBO in Equation (6) can be explicitly expressed. The reconstruction term is


where is a constant and


The penalty term is

where is a constant.

We can draw several implications from the closed-form expression of the ELBO. Maximizing the ELBO in Equation(6) is equivalent to minimizing the sum of Equation(8) and Equation(LABEL:eq:penalty-indpt-pair), which are balanced by and . Minimizing the reconstruction term forces the corrected KG/BG embeddings to behave similarly to the observed BG embeddings as per the selected edge function . It suggests that the reconstruction term preserves the topological structure of BG. Accordingly, minimizing Equation (LABEL:eq:penalty-indpt-pair) enforces the approximated posterior mean/variance to be close to the prior mean/variance. If the prior mean of is set to be , such minimization forces the corrected KG/BG embeddings to be close to the observed KG embeddings. It indicates the penalty term preserves the topological structure of KG. Thus, the refined KG/BG embeddings can be regarded as a mixture of information. The two parameters and act as controllers of such mixing. For example, a small indicates the corrected embeddings squint towards the observed KG embeddings other than the observed BG embeddings, vice versa.

4.4 Algorithm

Given all the components discussed above, we can write down the detailed algorithm of BEM. First, we sample two batches of samples of batch size , denoted as batch and ; then pair them up randomly, denoted as . For each batch, we impose the same prior information for all the samples in this batch, and estimate


where , is the number of bootstrap replicates, is the -th bootstrap estimator of from (), and . Then, for each pair of sample , use the inference network Equation (7) to get the approximated posterior information , , , , , , as shown in Figure 2 (b). Next, we sample standard normal variables to get and by Equation (9), where we set and , . We obtain the ELBO in Equation (6) via Equations (8)-(LABEL:eq:penalty-indpt-pair), as shown in Figure 2 (c). Finally, We can use any optimization method, such as Adam [21], to update and when maximizing the ELBO. We run the above steps for times, and we can get the refined KG/BG embedding for by

1:Pre-Trained KG/BG embeddings , ; tuning parameters , ; batch size , number of iterations .
2:for  do
3:     Sample two batches , of batch size , and pair them up as ;
4:     Estimate the prior information by by Equation(11);
5:     for  do
6:         Get the posterior information by Equation(7);
7:         Sample a standard normal variable from and respectively. Get and via Equation(9).      
8:     Obtain the ELBO in Equation(6) via Equations (8)-(LABEL:eq:penalty-indpt-pair);
9:     Update and by maximizing the ELBO.
10:for  do
11:     Get the refined KB/BG embeddings and by Equation(12).
12:Denote the and in the last round as and .
13:, , .
Algorithm 1 The BEM method.

To analyze the complexity of Algorithm 1

, we simply use two-layer MLPs (multi-layer perceptron) for

and . Let be the number of hidden nodes of these neural networks. Then it is easy to see that the computational complexity is , where is the number of iterations for the maximization step (Line ) in Algorithm 1. If we set , the computational complexity is . Furthermore, the storage complexity is just , since it merely needs to keep track of two sets of parameters in and . Therefore, the algorithm is efficient in both time and storage in the sense that the size of the dataset only affects the computational time linearly. However, when the dataset is too large to be entirely loaded into the CPU, the algorithm might suffer from a non-negligible overhead caused by partitioning and loading the data during the iteration.

4.5 Edge Function

The edge function in Equation (3) characterizes the interplay between nodes. The choice of this function determines what kind of KG information is incorporated into the BG embeddings. We give four examples as below:

  • A natural choice is the translation function i.e., , where . TransE and its variants are based on the translation operation, and aim to minimize the / loss between the translated embedding of the head entity and the corresponding embedding of the tail entity [5, 42, 27].

  • An arbitrary similarity function can be used that measures the similarity between and , where . Such choice coincides with the objective functions of the majority of BG/KG embedding methods [9, 7]. For instance, GraphSAGE [15], GCN [23], node2vec [13] etc., maximize the inner product between positive samples while minimizing this metric between negative samples.

  • If the edge function only relies on the index and , such as the edge attribute between node and node , BEM becomes a supervised model.

  • If the edge function is an identity function , then Model (3) is reduced to Model (4). Here, simply concatenates vectors and , thus .

In this article, we use the translation function . In fact, the translation function is equivalent to the similarity function using inner product or cosine similarity if the embeddings are normalized onto the unit sphere, such as embeddings generated by GraphSAGE, TransE and its variants. As shown in Figure 2 (d), the module of the difference between two points on the sphere is bijectively mapped to the angle between the rays from the origin to the two points.

5 Experiments

We empirically study and evaluate BEM on two small datasets and one large-scale dataset for a variety of tasks. Each dataset consists of one KG and one BG with pre-trained node embeddings. The goal of these experiments is to show that embeddings refined by BEM can outperform the original pre-trained embeddings on some tasks, while remaining the efficacy for most of the others:

  • The node classification task (on two small datasets) studies if BEM can help refine the KG/BG embedding using the BG/KG embedding. It also investigates whether BEM can reveals useful information in KG and BG for the classification purpose (Section 5.1.2).

  • The link prediction [5] and the triplet classification [38] (on two small datasets) investigate whether BEM can extract useful information from BG to refine the KG embedding.

  • The item recommendation task (on the large dataset) studies whether the information in KG can enhance the performance of the BG embedding.

For the node classification task, we study the KG/BG and the concatenated embeddings that are refined by BEM. In contrast, we only consider the KG embedding for the link prediction task and the triplet classification task since the two tasks are designed for the KG embedding. We only consider the BG embedding for the item recommendation for the same reasoning.

We implement333The code can be found at BEM as per Algorithm 1 based on tensorflow444 Throughout this section, we use the following default parameter setting:

  • Functions and are implemented as two-layer MLPs with

    hidden nodes and the ReLU

    [28] activation.

  • The batch size is , the optimization algorithm is Adam [21], the learning rate is , the number of training steps .

  • , .

A discussion on the selection of the above parameters is deferred to Appendix C.

5.1 Two Small Datasets

The two small datasets have the same KG but differ in the BGs. The shared KG is FB15K237, which is reduced from FB15K to remove the reversal relations [10]. There are entities, relations, and training triplets, validation triplets, testing triplets. The first dataset uses a pagelink network (denoted as pagelink) that records the linkages between the wikipedia pages of entities in FB15K237. It includes nodes (a subset of the entities in FB15K237) and links. The second dataset comes with a short paragraph description (denoted as desc) for each entity in FB15K237. Strictly speaking, the descriptions do not form a BG due to the lack of connection between descriptions. We regard them as an isolated graph to evaluate BEM under extreme conditions where BG does not contain any interplay information between nodes. See Appendix A for more details on the two datasets.

FB15K237 + pagelink
node2vec LINE
BEM KG BG concat KG BG concat
TransE O 85.59 75.12 89.39 85.59 77.57 89.44
I 85.51 82.56 85.97 86.35 85.44 87.05
P 88.89 86.32 90.29 88.21 86.27 90.01
TransD O 86.06 75.12 89.18 86.06 77.57 89.00
I 83.73 78.86 84.16 86.58 85.10 86.69
P 88.60 85.39 89.90 88.70 85.30 89.73
FB15K237 + desc
doc2vec sentence2vec
BEM KG BG concat KG BG concat
TransE O 85.32 75.62 87.92 85.32 83.42 88.43
I 86.19 81.50 86.41 87.61 85.18 88.07
P 87.68 81.52 87.86 88.05 85.82 88.57
TransD O 85.83 75.62 88.07 85.83 83.42 88.52
I 86.75 81.44 86.85 87.96 84.97 88.07
P 87.34 82.24 88.15 88.36 86.12 88.86
Table 3: The node classification accuracy (%) using the refined BG/KG embeddings by BEM. Here KG, BG and concat refer to the KG embedding, BG embedding and the concatenation of the KG and BG embeddings, respectively.

We use TransE [5] and TransD [18] from OpenKE555 [16] to pre-train KG’s embeddings. Both of them are trained for epochs with dimension and other parameters are taken as default. For the BGs, we use doc2vec [25] and sentence2vec [32] to pre-train desc BG embeddings, and node2vec [13] and LINE [39] to pre-train pagelink BG embeddings respectively. The dimension of the BG embedding is set to be . More details on the experiment and hyper-parameter setups are included in Appendix B.

5.1.1 Node classification

In the node classification task, there are

class labels. The embeddings are fed into a multi-label logistic regression model for training and prediction. Table

3 shows the results of BEM, from which we can draw three implications. First, we observe consistent improvements of BEM-P over BEM-O (the original embedding) through almost all settings (accuracies boosted by -

for KG and BG). It indicates that we can benefit from integrating information of the two sources. Second, if the classifier is sufficiently expressive, concat-O is expected to perform the best since there is no loss of information from the input. However, concat-P turns out to perform slightly better than concat-O in most cases. It suggests that

BEM-P not only preserves the information for node classification, but also reveals signals. Third, as we expected, BEM-P outperforms BEM-I since the former accounts for the pairwise interactions that are crucial for the embedding learning of KG/BG. Finally, we point out that the concatenated embedding and the KG/BG embedding are not comparable. The concatenated embedding is longer than the BEM-refined embedding, so the classifier for the former has more parameters, thus more expressive. For a fair comparison, we study the projection of the concatenated embedding onto the BG/KG space, and the associative results are deferred to Appendix D.

5.1.2 Empirical analysis

To understand the property of the embeddings refined by BEM-P, we perform two empirical data analyses on the FB15K237-pagelink dataset. First, we compute the absolute cosine similarity for each pair of nodes using KG-O, KG-P, BG-O, BG-P respectively. From Figure 3, we observe that the KG-P and BG-P are distributed more extremely than KG-O and BG-O — there are more highly correlated and more uncorrelated node pairs for the former. It indicates that BEM-P enforces some nodes to group tightly while some others are distracted from each other. This result can also be concluded by the visualization of the embeddings using t-SNE (Figure 4). Second, we use the class labels for the node classification task to compute

where , are two classes, and

This metric reflects the degree to which the topological structure of the embeddings aligns with the labels. We have , , and , indicating that BEM-P enforces nodes in the same classes to get closer to each other while nodes across classes are pulled away. This result suggests that BEM-P is able to preserve and further reveal the topological structure for both KG and BG.

Figure 3: Distribution of the similarities between nodes. Here, 1000,000 node pairs are sampled randomly.
Figure 4: Visualization of the embeddings. Blue: BEM-O; Red: BEM-P.

5.1.3 Link Prediction and Triplet Classification on the KG side

We evaluate BEM on the link prediction and the triplet classification tasks. Since BEM can only refine the entity embeddings, we retrain the relation embedding for another epochs using BEM-refined KG embeddings and the original relation embeddings as the initial values. In Table 4, notice that the KG embeddings can also benefit from incorporating the BG information via the BEM refining. In contrast, the concat-O embeddings are much inferior. It validates that the concatenation does not fully expose the topological structure of KG while BEM can make good use of this information. Moreover, we observe the improvement mainly occurs for the pagelink dataset. For the desc dataset, the TransD embeddings get improved slightly while the TransE embeddings get worse after applying the BEM refining. Such observation can be explained as the desc dataset does not provide supplementary interaction information to the KG graph.

Metrics Embedding FB15K237 + pagelink FB15K237 + desc
TransE TransD TransE TransD
node2vec LINE node2vec LINE doc2vec sentence2vec doc2vec sentence2vec
Hit@10 (%)
in LP-Filtered
KG-O 43.14 43.14 43.86 43.86 43.14 43.14 43.86 43.86
KG-I 42.25 43.00 44.31 44.56 41.86 42.05 42.31 44.58
KG-P 43.66 43.52 44.72 44.67 41.99 42.21 44.26 44.47
concat-O 36.99 37.47 38.32 38.45 40.17 40.07 40.79 37.83
Accuracy (%)
in TC
KG-O 76.56 76.56 78.29 78.29 76.56 76.56 78.29 78.29
KG-I 76.70 76.86 78.54 78.80 76.06 76.42 78.63 78.61
KG-P 77.13 77.09 78.96 79.11 76.17 76.21 78.70 78.60
concat-O 71.97 73.23 71.82 70.75 71.41 71.32 72.15 69.91
Table 4: Results of Link prediction (LP) and Triplet classification (TC).

5.2 A Large-Scale Dataset

#ent. #scenario #category #rel. #train
17.37M 182K 8.96K 5.18K 60.65M
#item #value #user #edge_click #edge_purchase
9.14M 8.04M 482M 7,952M 144M
Table 5: Specifications for the large-scale dataset.

In this section, we apply BEM to the KG/BG embeddings generated from an Alibaba Taobao’s large-scale dataset666The details of the Alibaba Taobao’s dataset are deferred to Appendix A., whose statistics are summarized in Table 5. Considering the computational efficiency, TransE is used to get the KG embeddings on a knowledge database established by Alibaba Taobao. As with the BG embeddings, we run GraphSAGE on a graph constructed in terms of users’ behaviors, e.g., two items are connected if a certain number of customers bought them simultaneously over the past months. GraphSAGE is a representative work for graph neural network (GNN) and has achieved good performances for large datasets. The dimension of KG embedding and the dimension of the BG embedding are , as the online setting of Alibaba Taobao. We take the recommendation task for evaluation. Specifically, each customer has a set of trigger items from his/her historical behaviors including clicks, purchases, add-to-preferences and add-to-carts. These trigger items are then used to retrieve (by FAISS [19]) more items based on the BG embeddings. We evaluate our method by counting the number of retrieved items that will be actually bought/clicked by the user in the following days. Table 6 exhibits the hit recall rates of the BG-P and BG-O on the recommendation task.

We check whether the retrieved items are of the same brand/category as those actually bought/clicked items in the following days. Combining these two granularities, we observe that the hit recall rates for BG-P are boosted by - compared to BG-O, which is quite significant considering there are over million items. It validates that BEM-P is able to incorporate useful KG information into the BG embedding for the item recommendation purpose.

Finally, for each concept/scenario, we use TransE to predict its top item categories based on KG-O and KG-P (see the detailed procedure as Section 5.1.3). The result shows that KG-P can find more related items for the given concepts, as shown in Table 7. It indicates that by incorporating the BG information via BEM, we can acquire novel knowledge that does not exist in the original KG.

Granularity Hit @ click buy
brand 10 15.97 16.14 24.87 25.10
30 16.65 17.12 25.70 26.57
50 17.26 17.90 26.39 27.33
category 10 27.46 27.40 27.85 27.91
30 28.43 29.99 28.50 29.45
50 29.58 32.88 29.26 31.47
Table 6: Hit recall rates (%) for item recommendation based on customer-specific trigger items. The recommended items are retrieved by finding the closest items to the trigger items using KG embeddings by BEM-P and BEM-O.
concept predicted categories using KG-O predicted categories using KG-P
neuter clothing jacket, homewear
Quick-drying T-shirt, sports down jacket
toning pants, aerobics clothes, warm pants
sports training None
Quick-drying T-shirt, sports down jacket,
Yoga T-shirt, training shoes, aerobics clothes,
sports bottle
household items
succulents, detergent, tissue box,
kitchen knife, man’s facial cleanser,
washing cup, yoga mat towel,
health tea, scented candle
washing machine cover, spray, table, tape,
fish tank cleaning equipment, pen container,
digital piano, maker, wood sofa bath bucket,
composite bed, mosquito patch, storage rack,
storage box, pillow interior, leather sofa,
needle, cotton swab, laundry ball, coffee cup,
desiccant, trash bag, indoors shoes,
Table 7: Examples in which BEM acquire novel knowledge that does not exist in the KG.

6 Discussion

In this paper, we introduce BEM, a Bayesian framework that can refine graph embeddings by integrating the information from the KG and BG sources. BEM has been evaluated on a variety of experiments. It is shown to be able to improve the embeddings on multiple tasks by leveraging the information from the other side. BEM can achieve superior or comparable performance with higher efficiency to the concatenation method (the baseline) for the node classification task, and can help in other tasks where the simple aggregation methods (e.g., concatenation) are not applicable. It is designed by bridging KG and BG via a Bayesian generative model, where the former is regarded as the prior while the latter is the observation.

Currently, only one BG is considered at a time in this work. In fact, BEM can be easily extended to deal with multiple BGs. The integration of more than one BGs may further refine the KG, as their behavior-specific biases can be mutually canceled out. Besides, for the time being, BEM works only for pre-trained KG/BG embeddings. It can be potentially extended so that the networks for the KG/BG embeddings are connected and jointly trained via this framework. In other words, BEM can act as an interface that connects any KG embedding method with any BG embedding method for the end-to-end training. This makes the learning of the BG embedding supervised by the KG information. In turn, the learning of the KG embedding can be supplemented with instantiated samples in BG.


Appendix A Dataset Details

The data of our experiments based on public datasets mainly includes FB15K237, pagelink network, descriptions of entities and labels of entities. Their sources are discussed below.

a.1 Small datasets

The two small datasets share KG but differ in the BGs. Their relations are depicted in Figure 5.

Figure 5: Illustration of KG and BG for the two small datasets.

Knowledge Graph We use FB15k-237, a subset of Freebase, as the knowledge graph, which is also used in ConvE [10]. Different from the popular data set FB15k used in many knowledge graph representation researches, it does not include the inverse relations that may cause leakage from the training set to the validation set. FB15k-237 has 14,541 entities, 237 relations, 272,115 training triples, 20,466 test triples and 17,535 validation triples.

Pagelink Network The pagelink network is a directed graph generated by ourselves. Since FB15k is a subset of Freebase, we first map the entities of FB15k to wikidata, that is a knowledge database to provide support for Wikipedia, Wikimedia Commons. according to the mapping data on the freebase database [1]. Then we use the pagelinks in English wikipedia to build the pagelink network. Since we could not get all the data, entities in the pagelink network are fewer than them in the knowledge graph. The pagelink network has 14,071 vertices and 1,065,412 edges in total.

Descriptions of Entities The descriptions used in our experiments are the same as DKRL [44]. It has 14,904 English descriptions of entities.

Labels of Entities In wikidata, the property ’instance of’ is an isA relation which represents the class that the entity belongs to. Therefore, we use the property values of ’instance of’ to represent the labels of entities used in the node classification task. At the same time, we also consider the problem of information leakage. In Freebase, the relation ’type/object/type’ represents the type of an entity. To avoid that this relation may leak information to evaluation tasks, we check that the relation ’type/object/type’ is not used in the triples of training set.

a.2 Large dataset

Knowledge Graph of Alibaba Taobao The knowledge graph of Alibaba Taobao items shows a tree structure. It contains four types of entities: items, categories items belong to, scenes of the categories, and the attribute values of the items. Therefore, there are three types of triples:

  • ,

  • ,

  • .

Among the above three types of triplets, the first one is N-N mapping, the second one is 1-N mapping and the third one is N-N mapping.

Behavior Graph of Alibaba Taobao The behavior graph of Alibaba Taobao is a bipartite graph that contains both user and item nodes. Interactions between users and items are CLICK or BUY which were sampled from a slicing window of 2 weeks (Dec. 27th, 2018 - Jan. 10th, 2019). The data of the first week was used for training. We used the trained model to recommend items for users with trigger items collected on Jan. 5th, 2019, and checked whether these recommended items were really clicked/bought in the following week.

Each user has specific features describing their certain properties, e.g. age, gender, occupation, preference towards some category of items, the recently clicked items, and each item has features like price, category, brand, etc. Edges (interactions) have weights that decay with time. When learning the node embedding of the behavior graph, we use the edges between the user and the item as positive samples and randomly corrupted edges as negative samples. Node features are incorporated alone with edges in the training phase.

Appendix B Further details on functions

To get embeddings of different data sets, we use several functions. The details of them are shown below.

TransE TransE is a typical knowledge graph representation method [5]. It treats relations in knowledge graph as translating operators from head entities to tail entities, which is represented as


In this work, we use the TransE API offered by [16] to get embeddings of entities in knowledge graph.

node2vec Node2vec is a network representation framework [13]. It uses a biased random walk procedure to preserve the neighborhood information of the network in node representation. We believe the neighborhood information in the pagelink network can help characterize an entity, so we use it to generate vertex embeddings of pagelink network. In our experiment, we set the parameters as follows: the length of walk is 80, the number of walks is 10, the context size is 10.

LINE LINE is a network representation method [39]. It preserves the first-order and second-order proximities in a network. In this work, we use the LINE API offered by OpenKE to get entity embeddings in the pagelink network. In our experiment, we set the negative ratio is 5, and uses both the 1st-order and the 2nd-order proximity of graphs.

doc2vec Doc2vec is an unsupervised framework to get embeddings of given sentences or paragraphs [25]. Embeddings of documents are trained to predict the words according to its context in the documents. We use it to get entity embeddings based on entity descriptions. In our experiment, we use PV-DM (Distributed Memory Model of paragraph vectors) to get the embeddings of documents.

sentence2vec Sentence2vec is an unsupervised, C-BOW-inspired framework to get embeddings of sentences or paragraphs [32]. It has been proven to have a state-of-the-art performance in sentence similarity comparison task. Therefore, we use it to generate entity embeddings based on descriptions for the purpose of reconstruct the graph based on vertex similarity. In our experiment, we set the parameters as follows: the learning rate is 0.2, the update rate of learning rate is 100, the number of epochs is 5, the minimal number of word occurrences is 5, the minimal number of label occurrences is 0, the max length of word gram is 2.

GraphSAGE GraphSAGE is an inductive representation learning framework. Unlike transductive graph embedding frameworks that only generate embeddings for seen nodes, GraphSAGE leverages node attribute information to learn node embeddings in a generalized way and thus is capable of generating representations on unseen data. We use GraphSAGE to learn node embeddings on Alibaba Taobao’s Behavior Graph.

Appendix C The selection of parameters for Algorithm 1

To understand how the tuning parameters influences the performance of BEM-P, we apply Algorithm 1 to pre-trained FB15K237 embeddings (KG) obtained by TransE and pre-trained pagelink embeddings obtained by node2vec. Each time we only change one parameter based on the default setup mentioned in Section 5, i.e., , , , , . The associative results of link prediction and triplet classification are displayed in Table 8. We can draw a few conclusions from such results:

  • , and the number of training steps affect the BEM-P marginally. It indicates that BEM-P does not require high model complexity for expressiveness and converges quickly.

  • The learning rate is worth tuning as other gradient-based algorithms.

  • The most important parameters are and . As explained in the last paragraph of Section 4.3, they balance the reconstruction term and the penalty term in Equation (8) and Equation (LABEL:eq:penalty-indpt-pair). Tuning and based on a validation set might give significant boost in performance. But if the user wants to skip tuning, and can be the good starting point.

For the node classification task, we get similar results using the same dataset.

Hit@10 (10%) in LP-Filtered Accuracy (%) in TC
200 500 800 200 500 800
evaluation result 43.38 43.66 43.41 77.25 77.13 77.19
100 500 1000 100 500 1000
evaluation result 43.77 43.66 42.63 77.40 77.13 77.28
learning rate 0.0001 0.001 0.005 0.01 0.0001 0.001 0.005 0.01
evaluation result 42.53 43.66 44.95 45.10 76.59 77.13 77.89 77.74
10 20 50 100 10 20 50 100
evaluation result 43.77 43.66 44.03 43.83 76.55 77.13 77.39 76.97
0.01 0.1 1 5 0.01 0.1 1 5
evaluation result 44.91 45.82 43.66 31.90 78.21 79.33 77.13 71.08
0.01 0.1 1 5 0.01 0.1 1 5
evaluation result 41.59 41.74 43.66 44.35 76.11 76.61 77.13 77.41
Table 8: The results of link prediction and triplet classification for the TransE method on the FB15K237 dataset and the node2vec method on the pagelink dataset, with varying tuning parameters. The default parameters are , , , , . Each row in the table only changes one parameter while keeping the others the same as default.

Appendix D More results of the node classification task on the FB15K237 dataset with two associative BGs

It is unfair to compare the BEM-refined embedding to the concatenated embedding directly, since the latter is longer than the former. In our case, the length of the concatenated embedding is () times longer than that of the KG (BG) embedding. Thus, the classifier for the concatenation has () more parameters than that of the KG (BG) embedding. To get a fair comparison, we project the concatenated embedding into () using a random Gaussian projection matrix, which can nearly preserve the distances between nodes.

Table 9 and 10 illustrate the results of BEM with its variations on the node classification results. Four implications can be drawn by looking at the table in different ways. First, we observe consistent improvements of BEM-P over BEM-O through all settings. The classification accuracies on the BG (KG) embedding are boosted by about - with BEM-P. As for the concatenation version, the concat-O vector is expected to work better than embeddings by BEM if the classifier is expressive enough — there might be loss of information during the procedure of the BEM integration. However, it turns out that concat-P outperforms concat-O. It indicates that BEM-P does not lose information related to the classification task, and is able to make the embeddings into a better shape for the classification task. Second, for a fair comparison in terms of the dimension, we use Gaussian random projections (repeated for times) to project the concatenated embedding to and , respectively. KG-P is superior to the projections of concat-O (for both and ), and is even comparable to concat-O. From the perspective of dimension reduction, this result suggests that BEM-P can preserve the majority of information for KG. On the other hand, considering the goal of preserving the topological structure, BEM-P is unlikely to boost the performance of low-quality BG-O to the level of concat-O. Third, we note the projections of concat-P loses marginal power during the dimension reduction, and are more robust than the projections of concat-O. It indicates that the BEM-P representation is less noisy than the original embeddings. Finally, as we expect, BEM-P outperforms the BEM-I where the former accounts for the pairwise interactions. Such key information is crucial for the learning of both the KG and BG embeddings.

BEM KG BG concat concat concat
TransE O 85.59 75.12 82.86 (0.93) 87.58 (0.27) 89.39
I 85.51 82.56 83.96 (0.35) 85.39 (0.16) 85.97
P 88.89 86.32 88.71 (0.20) 89.27 (0.16) 90.29
TransD O 86.06 75.12 82.37 (0.82) 86.24 (0.26) 89.18
I 83.73 78.86 81.67 (0.80) 83.83 (0.28) 84.16
P 88.60 85.39 87.83 (0.26) 88.92 (0.27) 89.90
BEM KG BG concat concat concat
TransE O 85.59 77.57 79.49 (1.14) 86.54 (0.40) 89.44
I 86.35 85.44 85.65 (0.34) 86.63 (0.14) 87.05
P 88.21 86.27 88.21 (0.48) 89.12 (0.15) 90.01
TransD O 86.06 77.57 77.51 (1.16) 85.80 (1.03) 89.00
I 86.58 85.10 85.36 (0.35) 86.40 (0.18) 86.69
P 88.70 85.30 87.95 (0.44) 88.82 (0.15) 89.73
Table 9: The node classification accuracy (%) on the FB15K237-pagelink dataset, using the BG/KG embeddings refined by BEM. Here concat/concat refers to the projection of concat into /, and concat refers to the concatenated embedding itself. The numbers in the brackets of concat/

are the standard errors across

random projections.
BEM KG BG concat concat concat
TransE O 85.32 75.62 83.87 (0.63) 86.77 (0.23) 87.92
I 86.19 81.50 85.32 (0.48) 86.10 (0.08) 86.41
P 87.68 81.52 86.40 (0.21) 87.78 (0.21) 87.86
TransD O 85.83 75.62 83.19 (0.62) 86.57 (0.35) 88.07
I 86.75 81.44 85.48 (0.73) 86.52 (0.13) 86.85
P 87.34 82.24 86.31 (0.43) 87.57 (0.19) 88.15
BEM KG BG concat concat concat
TransE O 85.32 83.42 86.67 (0.39) 87.58 (0.24) 88.43
I 87.61 85.18 86.95 (0.31) 87.70 (0.14) 88.07
P 88.05 85.82 87.61 (0.16) 88.36 (0.24) 88.57
TransD O 85.83 83.42 85.69 (0.32) 87.83 (0.16) 88.52
I 87.96 84.97 86.89 (0.4) 87.91 (0.24) 88.07
P 88.36 86.12 87.40 (0.20) 89.59 (0.18) 88.86
Table 10: The node classification accuracy (%) on the FB15K237-desc dataset, using the BG/KG embeddings refined by BEM. Here concat/concat refers to the projection of concat into /, and concat refers to the concatenated embedding itself. The numbers in the brackets of concat/ are the standard errors across random projections.


  • [1] Data dumps — freebase api. December 20, 2018.
  • [2] P. Battaglia, R. Pascanu, M. Lai, and D. J. Rezende. Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems, pages 4502–4510, 2016.
  • [3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
  • [4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • [5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
  • [6] Kenneth P Burnham and David R Anderson. Kullback-leibler information as a basis for strong inference in ecological studies. Wildlife research, 28(2):111–119, 2001.
  • [7] Hongyun Cai, Vincent W Zheng, and Kevin Chang. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891–900. ACM, 2015.
  • [9] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [10] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [11] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur. Protein interface prediction using graph convolutional networks. In Neural Information Processing Systems, pages 6530–6539, 2017.
  • [12] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall/CRC, 1995.
  • [13] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
  • [14] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto. Knowledge transfer for out-of-knowledge-base entities : A graph neural network approach. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 1802–1808, 2017.
  • [15] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
  • [16] Xu Han, Shulin Cao, Lv Xin, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. Openke: An open toolkit for knowledge embedding. In Proceedings of EMNLP, 2018.
  • [17] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In

    Workshop in Advances in Approximate Bayesian Inference, NIPS

    , 2016.
  • [18] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via dynamic mapping matrix. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    , volume 1, pages 687–696, 2015.
  • [19] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
  • [20] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song.

    Learning combinatorial optimization algorithms over graphs.

    In Advances in Neural Information Processing Systems 30, pages 6348–6358. 2017.
  • [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [23] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [24] Marius Kloft and Gilles Blanchard. The local rademacher complexity of lp-norm multiple kernel learning. In Advances in Neural Information Processing Systems, pages 2438–2446, 2011.
  • [25] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In

    International Conference on Machine Learning

    , pages 1188–1196, 2014.
  • [26] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2181–2187, 2015.
  • [27] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, volume 15, pages 2181–2187, 2015.
  • [28] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • [29] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
  • [30] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. Stranse: a novel embedding model of entities and relationships in knowledge bases. In HLT-NAACL, pages 460–466. The Association for Computational Linguistics, 2016.
  • [31] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In KDD, 2016.
  • [32] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
  • [33] John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
  • [34] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
  • [35] Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han. An attention-based collaboration framework for multi-view network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1767–1776. ACM, 2017.
  • [36] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. In arXiv preprint, page 1806.01242, 2018.
  • [37] Yu Shi, Fangqiu Han, Xinwei He, Xinran He, Carl Yang, Jie Luo, and Jiawei Han. mvn2vec: Preservation and collaboration in multi-view network embedding. arXiv preprint arXiv:1801.06597, 2018.
  • [38] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng.

    Reasoning with neural tensor networks for knowledge base completion.

    In Advances in neural information processing systems, pages 926–934, 2013.
  • [39] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
  • [40] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817–826. ACM, 2009.
  • [41] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.
  • [42] Z Wang, J Zhang, J Feng, and Z Chen.

    Knowledge graph embedding by translating on hyperplanes.

    In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), pages 985–991, 2014.
  • [43] Jiawei Wu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Knowledge representation via joint learning of sequential text and knowledge graphs. arXiv preprint arXiv:1609.07075, 2016.
  • [44] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learning of knowledge graphs with entity descriptions. In AAAI, pages 2659–2665, 2016.
  • [45] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • [46] Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cunchao Tu. Fast network embedding enhancement via high order proximity approximation. In IJCAI, pages 3894–3900, 2017.
  • [47] Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, and R Bharat Rao. Bayesian co-training. Journal of Machine Learning Research, 12(Sep):2649–2680, 2011.
  • [48] Deming Zhai, Hong Chang, Shiguang Shan, Xilin Chen, and Wen Gao. Multiview metric learning with global consistency and local smoothness. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):53, 2012.