1 Introduction
Graphs widely exist in the real world, including social networks [15, 23], physical systems [2, 36], proteinprotein interaction networks [11], knowledge graphs [14] and many other areas [20]. There may be different views of the same set of nodes, and thus graphs of different architectures are built. For example, in the ecommerce industry, itemitem networks can be constructed based on the user behaviors of clicks, purchases, addtopreferences and addtocarts respectively — two items are linked if they are clicked (or via other operations) by the same user. A corresponding knowledge graph can be crafted to represent a collection of interlinked descriptions of the items, e.g., color, materials, functions. Throughout this article, we refer to the graph with respect to a certain behavior/context as the behavior graph (BG)^{1}^{1}1The concept of BG covers a wide range of conventional graphs and networks: pagelink networks (the link behavior), authorcitation networks (the citation behavior), itemitem interaction behavior (the coclick, copurchase behaviors and etc.), to name a few., in order to distinguish it from the knowledge graph (KG) that consists of structured symbolic knowledge (triplets). KG and BG both reflect the interactions between entities/nodes in reality, but they differ in two aspects: 1) the graph structures; 2) the contained information; see Section 2 for a detailed discussion. The connection and the distinction between KG and BG imply that they can be complementary to each other. It is of great interest to integrate these two types of graphs in a unified way.
Benefits of the integration of KG and BG. In the sequel, we give three perspectives with examples in the ecommerce industry to illustrate the benefits of incorporating KG and BG. First, KGaided BG can achieve accurate recommendations. For instance, given a formal dress and highheel shoes, methods based on BG alone may recommend arbitrary lipsticks. With information from the KG, it can make a better recommendation of formal lipsticks instead of sweet lipsticks, as KG has the knowledge that the dress and the shoes are associated with formal occasions. Second, KGaided BG can do more than BG alone. Suppose a user buys a ticket to Alaska in January, the knowledge “enjoying aurora in Alaska in winter” is triggered in KG. So it can recommend down jacket, outdoor shoes and tripods for the aurora viewing in a freezing environment. But methods using BGonly embeddings can hardly connect the flight ticket to such outfits. Third, novel knowledge can be discovered from BG on top of the known. For example, recent clothing fashions can be inferred by the frequently coclicked or copurchased clothes. Then humans’ common sense or other experts’ knowledge can be used to identify the most likely choice of the fashion of this year.
Motivation.
To deal with multiple graphs, a standard practice is to embed the nodes as vectors while simultaneously integrating the information from all the sources
[47, 24, 48]. To the best of our knowledge, however, there is no existing method that jointly learns the BG embedding and the KG embedding. As an alternative solution, it is common to take the pretrained KG/BG embeddings as the input to learn the representation of BG/KG [43, 44, 15]. Or one can simply learn the embeddings of KG and BG separately, then incorporate them via an aggregation method, e.g., concatenation, linear combination. For the first strategy, the interaction information contained in the KG/BG embedding can be distorted if it does not agree with that of BG/KG. For the second strategy, the topological structure from either side is either disguised (e.g., concatenating a short embedding with a long embedding) or destroyed (e.g., taking the average of two embeddings of the same length). In this article, we work with the pretrained BG and KG embeddings as this strategy is widely applicable. Our goal is to integrate BG and KG without losing the topological information from both sides.Contribution. Throughout this paper, we consider only one KG and one BG. We develop a Bayesian framework called BEM (Bayes EMbedding) that refines the KG and BG embeddings in an integrated fashion while preserving and revealing the topological information from the two sources. The key idea behind BEM is that the KG embedding, plus a behaviorspecific bias correction term, acts as the prior information for the generation of the BG embedding; see Figure 1 (c). BEM aims to maximize the likelihood under this Bayesian generative model. Our contribution is twofold. From the perspective of modelling, BEM is proposed to bridge KG and BG seamlessly, with the consideration of their respective topological structures. As a framework, BEM is general and flexible in that it can take any pretrained KG embeddings and any BG embeddings to mutually refine themselves.
The rest of the paper is organized as follows. In Section 2, we discuss the difference and connection between KG and BG. In Section 3, we review works that are related to our method. In Section 4, we present our method BEM. In the sequel, we demonstrate the utility of BEM in three application studies involving two small datasets related to Freebase and a large dataset in ecommerce (Section 5). We test the BEMrefined embeddings in varieties of downstream tasks. Finally, we conclude with a discussion of the BEM framework and highlight promising directions for future work in Section 6.
2 Discussion of KG and BG
Here we discuss the difference and the connection between KG and BG to illustrate three points: 1) KG and BG are different and hard to jointly learn; 2) KG and BG contain complementary (distinct but related) information, and therefore it is promising to get better embeddings by integrating the two types of graphs; 3) KG and BG can be unified from two reasonable perspectives.
Difference between KG and BG. There are mainly two differences between KG and BG. First, KG encodes entities and their relations in the form of the triplet as , where , and are the head entity, the tail entity and their relation. It corresponds to a directed and highly heterogeneous network. In comparison, BG is constructed based on the interplay between the nodes under certain task/behaviorspecific contexts. It corresponds to an undirected network with limited number of edge types (homogeneous or less heterogeneous than KG). Figure 1 (a) shows the difference between KG and BG in terms of the network structure. The distinction in structure makes it difficult to put the two graphs in a single framework for embedding learning. Second, the triplets in KG are extracted from authentic knowledge and experience. Thus, KG is a semantic network reflecting relatively objective facts that can stand the test of time. As for BG, it embodies a timevarying and behaviorbiased link between nodes, which we illustrate with two examples: 1) People may buy sunglasses and swimwear at the same time in summer, but they will barely purchase these two items in winter; 2) Two sorts of sunglasses can be viewed (the click behavior) for comparison but they are rarely bought (the purchase behavior) together. The difference in information between KG and BG indicates that they can complement each other.
Connection between KG and BG.
Despite the distinction, KG and BG are also closely related, resembling the connection between humans’ knowledge and experience. KG can be regarded as an abstracted graph that reflects the shared properties among multiple BGs. This bottomup idea (from BGs to KG) implies that it is possible to acquire novel knowledge from all kinds of BGs. On the contrary, we can heuristically interpret the connection from top down, as shown in Figure
1 (c). KG contains the general information of items, e.g., the item properties (color, materials etc.), the category of the item, the concepts/scenarios^{2}^{2}2Scenarios are manually crafted to include items that appear together frequently under certain conditions. For example, the sunglasses and the swimwear both belong to the scenario “summerbeach”.of the category. Then, the node of BG can be thought of as being generated by adjusting the associative entity in KG with a behaviorspecific correction term. For instance, the cellphone is conceptually a portable electronics (KG). It exhibits varieties of properties under different scenarios (BG), e.g., a communication tool when connecting to others, an entertainment platform when playing games, a working/studying tool when looking up information online. The topdown idea indicates that we can use KG information to help the learning of BG.3 Related Work
In this section, we review related work to our method. As to the best of our knowledge, there is no existing method that learns the BG embeddings and the KG embeddings jointly. We first introduce multiview learning that is closest to this goal. Then we review alternative methods, followed by classic representation learning methods for conventional graphs and knowledge graphs.
3.1 Multiview Embedding Learning
In real life, entities may have different feature subsets which is called multiview data. For instance, in ecommerce, an item may be associated with different behavior data in different scenarios, such as the data of purchases, clicks, addtopreferences and addtocarts. These multiview data can be learned to get a uniform representation for one item. For this purpose, varieties of approaches have been proposed, including cotraining, multiple kernel learning, and subspace learning [47, 24, 48]. In particular, many efforts have been made in multiview network representation learning. Qu et al. [35] combines the embeddings of different network views linearly. Shi et al. [37] proposes two characteristics (preservation and collaboration), and gets node vectors by simultaneously modeling them. It is closely related to our work in the sense that it emphasizes the integration of different sources while preserving their own specialties. However, it only deals with homogeneous networks as other multiview embedding learning methods. In contrast, our method is designed to combine BG with KG, which differ in the data structures and the contained information.
3.2 Alternative Ways to Integrate KG and BG
There are alternative approaches to integrate KG and BG. First, the standard practice is to embed one graph into vectors, then take the embeddings as the input of the learning for the other graph. For example, Wu et al. [43] embeds sequential texts, then takes them as node/entity attributes for knowledge graph learning. Xie et al. [44] learns knowledge graph embedding by using the embeddings of entity descriptions. Hamilton et al. [15] can take as input the pretrained KG embeddings to learn BG embeddings as well. However, this line of works tends to focus on the targeted graph (the graph that uses the pretrained embedding for learning), but the topological structures from the other graph (the graph that generates the pretrained embedding) may be missing. Even though interaction information between nodes is contained in the pretrained embeddings, it can be weakened or ignored if not agreeing with the topology from the targeted graph. Second, there is a even simpler strategy to integrate KG and BG, i.e., learning the embeddings of KG and BG separately, then incorporating them via an aggregation method, e.g, concatenation, linear combination [15]. Nonetheless, the topological structures from both sides are disguised or destroyed by these aggregation methods. Our work falls in the second category, and is designed to solve the above issue: it preserves and reveals the topological information when integrating BG and KG.
3.3 Representation Learning for BG and KG
Here we review methods used to pretrain BG and KG embeddings. A line of works perform graph embedding based on graph spectrum [3, 40]. Some works use matrix factorization to get node embeddings [45, 8, 46]
. Additionally, simple neural networks are used to generate embeddings by making the distribution of the node embeddings close to that obtained by the topological structure
[34, 39]. Recently, some graph neural network based techniques are also proposed and widely applied [23, 31, 15, 41].Since KG differs from BG due to the semantic links between entities, the above embedding methods are not applicable to KG. Many efforts have been made to embed the nodes in KG. As a seminal work, TransE [5] learns a low dimensional vector for every entity and relation in KGs. Later extensions include TransH [42], TransR [26] and STransE [30] for more flexibilities.
4 Methods
4.1 Notation
We denote and as the KG embedding and the BG embedding with dimension and respectively. For a vector , let be the dimension of , and let be its th entry. We use for elementwise multiplication, i.e., for two vectors and with length , . Denote as the KullbackLeibler (KL) divergence between distributions and [6]. Other detailed notations used throughout this section are summarized in Table 1.
Notation  Meaning 
Entity.  
KG embedding.  
BG embedding.  
Dimension of /.  
The behaviorspecific correction term.  
The nonlinear transformation that projects the refined (corrected) KG embedding into the BG space.  
Projection of the KG embedding onto the behavior space by .  
The edge function that characterizes the interaction between entities in the behavior space.  
The distribution of .  
The distribution of .  
The inference network.  
/  all the latent variables for /. 
4.2 The Generative Model
Section 2 sheds light on the bottomup and topdown relations between KG and BG. KG is thought of as the abstract representation of an entity, and BG is its realization under certain context. We can view BG as a mix of KG and a contextspecific factor (an adjustment term), but usually it only reflects some aspect of KG (i.e., a projection of the mix). Such insights motivate us to connect KG and BG in a generative model as follows.
Throughout this paper, we focus on the case where each entity has one KG embedding and one BG embedding. Mathematically, suppose there are entities and each entity has a KG embedding and a BG embedding . As depicted in Figure 1, and act as priors and observations respectively. We use to model the adjustment effects between and . In other words, acts as an residual to so that is sufficient to determine the marginal distribution of via a projection function . The projection not only reflects the fact that BG characterizes KG partially, but is also technically required to map
into the BG space. To be more specific, we assume the joint distribution of
hinges on the following three components:
”Refined” KG embeddings , where are sampled from the behaviorspecific distribution ;

The nonlinear transformation that projects the refined KG embedding into the BG space;

The distribution of BG embedding .
Then, write the generative model as
(1) 
Our target is to optimize the following objective function:
(2)  
However, the objective function Equation (2) under Model (1) is generally intractable. For the sake of computational feasibility, assumptions are needed to simplify the model:

To reduce the model complexity, we assume ’s are identically independently distributed, i.e., , where is shared by all the entities.

To retain the interaction information between entities, we come up with an edge function that characterizes the interplay between and . For example, can be the similarity or the vector difference between and . Then, is assumed to be a generative distribution for .

To further reduce the model complexity, we assume ’s are i.i.d sampled from , where is shared for all pairs of .
Then, Model (1) is reduced to
(3) 
which is visualized as Figure 2 (a). Compared to Model (1), the reduced model has a much smaller model complexity while retaining the interaction information between entities, i.e., preserving the topological structure, which is crucial for all the BG and KG embedding methods [9, 7]. We call this model BEMP (“P” denotes pairwise interactions). In comparison, we can ignore the interactions for further complexity reduction:
(4) 
In fact, Model (4) is a special case of Model (3) by letting and assuming . Then it becomes a model with full independence. We call this model BEMI (“I” denotes vertex independence). Finally, for the sake of simplicity, we denote BEMO (“O” denotes NULL) as using the original embeddings directly without applying BEM. All these models are summarized in Table 2. In the sequel, we will omit the subscript , and for simplicity if it does not brings about ambiguity.
Abbreviation  Meaning  
BEMP 


BEMI 


BEMO  Without applying BEM.  
P, I ( {BG, KG}) 


O ( {BG, KG}) 


concatX (X {P, I, O}) 

4.3 The Inference Model
Given Equation (3), the objective function (2) can be rewritten as
(5) 
There are varieties of offtheshelf methods to optimize Equation (5), such as the EM [29] or MCMC [12] algorithm. But these methods usually fail due to intractability of scalability. To this end, we resort to variational inference [4], which is very popular for largescale scenarios or distributions with intractable integrals. Let be a set of all the latent variables for node , and . For example, in the generative model Equation (4), , and . It is easy to derive that
(6)  
where is called the inference model [22], i.e., an approximated density function to the posterior density of given . is the associated prior density. Formula (6) is also called the variational lower bound or evidence lower bound (ELBO) [17] for . The first term in the ELBO is termed as the reconstruction term that measures the goodness of the fit, while the second one is a penalty term that measures the distance between the approximated density to the prior density. Then, our goal of maximizing
can be relaxed to maximizing the ELBO. It is wellknown that the naive MonteCarlo gradient estimator exhibits very high variance and is impractical when
is large [33]. Thus we will utilize particular distributions and introduce additional assumptions to further simplify the ELBO.We assume to be a multivariate normal density. Assume to be a multivariate normal density with mean and variance matrix , where is the samplespecific variance (see Figure 2 (a)). Here, and
are assumed to be sampled from a multivariate lognormal distribution. We introduce the latent variable
and to account for the nuisance variation induced by sampling (see Section 4.4). Here we choose the multivariate normal/lognormal distribution because it enjoys appealing statistical and computational properties: 1) normal/lognormal random variables are easy to sample; 2) normal/lognormal distributions can be easily reparametrized with only two parameters
[22]; 3) There is a closedform expression for the KL divergence between two normal/lognormal distributions.By introducing the latent variable , the set of latent variables for node becomes and . We then impose two common conditions in the meanfield variational inference [22]:

Both and are from meanfiled family. That is

and are normal and lognormal densities with a diagonal covariance matrix, respectively.
Thus, the approximated posterior means and variances of each element in can be represented by a function of and , denoted as , which is called the inference network. In detail,
(7) 
where , , , are the approximated posterior means and variances (a vector consisting of the diagonal elements of the covariance matrix) of and respectively. With the reparametrization trick, we can express , and . Correspondingly, we express their prior means and variances as , , , , where and are two tuning parameters. Then the ELBO in Equation (6) can be explicitly expressed. The reconstruction term is
(8)  
where is a constant and
(9) 
The penalty term is
where is a constant.
We can draw several implications from the closedform expression of the ELBO. Maximizing the ELBO in Equation(6) is equivalent to minimizing the sum of Equation(8) and Equation(LABEL:eq:penaltyindptpair), which are balanced by and . Minimizing the reconstruction term forces the corrected KG/BG embeddings to behave similarly to the observed BG embeddings as per the selected edge function . It suggests that the reconstruction term preserves the topological structure of BG. Accordingly, minimizing Equation (LABEL:eq:penaltyindptpair) enforces the approximated posterior mean/variance to be close to the prior mean/variance. If the prior mean of is set to be , such minimization forces the corrected KG/BG embeddings to be close to the observed KG embeddings. It indicates the penalty term preserves the topological structure of KG. Thus, the refined KG/BG embeddings can be regarded as a mixture of information. The two parameters and act as controllers of such mixing. For example, a small indicates the corrected embeddings squint towards the observed KG embeddings other than the observed BG embeddings, vice versa.
4.4 Algorithm
Given all the components discussed above, we can write down the detailed algorithm of BEM. First, we sample two batches of samples of batch size , denoted as batch and ; then pair them up randomly, denoted as . For each batch, we impose the same prior information for all the samples in this batch, and estimate
(11) 
where , is the number of bootstrap replicates, is the th bootstrap estimator of from (), and . Then, for each pair of sample , use the inference network Equation (7) to get the approximated posterior information , , , , , , as shown in Figure 2 (b). Next, we sample standard normal variables to get and by Equation (9), where we set and , . We obtain the ELBO in Equation (6) via Equations (8)(LABEL:eq:penaltyindptpair), as shown in Figure 2 (c). Finally, We can use any optimization method, such as Adam [21], to update and when maximizing the ELBO. We run the above steps for times, and we can get the refined KG/BG embedding for by
(12) 
To analyze the complexity of Algorithm 1
, we simply use twolayer MLPs (multilayer perceptron) for
and . Let be the number of hidden nodes of these neural networks. Then it is easy to see that the computational complexity is , where is the number of iterations for the maximization step (Line ) in Algorithm 1. If we set , the computational complexity is . Furthermore, the storage complexity is just , since it merely needs to keep track of two sets of parameters in and . Therefore, the algorithm is efficient in both time and storage in the sense that the size of the dataset only affects the computational time linearly. However, when the dataset is too large to be entirely loaded into the CPU, the algorithm might suffer from a nonnegligible overhead caused by partitioning and loading the data during the iteration.4.5 Edge Function
The edge function in Equation (3) characterizes the interplay between nodes. The choice of this function determines what kind of KG information is incorporated into the BG embeddings. We give four examples as below:

An arbitrary similarity function can be used that measures the similarity between and , where . Such choice coincides with the objective functions of the majority of BG/KG embedding methods [9, 7]. For instance, GraphSAGE [15], GCN [23], node2vec [13] etc., maximize the inner product between positive samples while minimizing this metric between negative samples.

If the edge function only relies on the index and , such as the edge attribute between node and node , BEM becomes a supervised model.
In this article, we use the translation function . In fact, the translation function is equivalent to the similarity function using inner product or cosine similarity if the embeddings are normalized onto the unit sphere, such as embeddings generated by GraphSAGE, TransE and its variants. As shown in Figure 2 (d), the module of the difference between two points on the sphere is bijectively mapped to the angle between the rays from the origin to the two points.
5 Experiments
We empirically study and evaluate BEM on two small datasets and one largescale dataset for a variety of tasks. Each dataset consists of one KG and one BG with pretrained node embeddings. The goal of these experiments is to show that embeddings refined by BEM can outperform the original pretrained embeddings on some tasks, while remaining the efficacy for most of the others:

The node classification task (on two small datasets) studies if BEM can help refine the KG/BG embedding using the BG/KG embedding. It also investigates whether BEM can reveals useful information in KG and BG for the classification purpose (Section 5.1.2).

The item recommendation task (on the large dataset) studies whether the information in KG can enhance the performance of the BG embedding.
For the node classification task, we study the KG/BG and the concatenated embeddings that are refined by BEM. In contrast, we only consider the KG embedding for the link prediction task and the triplet classification task since the two tasks are designed for the KG embedding. We only consider the BG embedding for the item recommendation for the same reasoning.
We implement^{3}^{3}3The code can be found at https://github.com/Elric2718/Bayes_Embedding. BEM as per Algorithm 1 based on tensorflow^{4}^{4}4https://www.tensorflow.org/. Throughout this section, we use the following default parameter setting:

The batch size is , the optimization algorithm is Adam [21], the learning rate is , the number of training steps .

, .
A discussion on the selection of the above parameters is deferred to Appendix C.
5.1 Two Small Datasets
The two small datasets have the same KG but differ in the BGs. The shared KG is FB15K237, which is reduced from FB15K to remove the reversal relations [10]. There are entities, relations, and training triplets, validation triplets, testing triplets. The first dataset uses a pagelink network (denoted as pagelink) that records the linkages between the wikipedia pages of entities in FB15K237. It includes nodes (a subset of the entities in FB15K237) and links. The second dataset comes with a short paragraph description (denoted as desc) for each entity in FB15K237. Strictly speaking, the descriptions do not form a BG due to the lack of connection between descriptions. We regard them as an isolated graph to evaluate BEM under extreme conditions where BG does not contain any interplay information between nodes. See Appendix A for more details on the two datasets.
FB15K237 + pagelink  
node2vec  LINE  
BEM  KG  BG  concat  KG  BG  concat  
TransE  O  85.59  75.12  89.39  85.59  77.57  89.44 
I  85.51  82.56  85.97  86.35  85.44  87.05  
P  88.89  86.32  90.29  88.21  86.27  90.01  
TransD  O  86.06  75.12  89.18  86.06  77.57  89.00 
I  83.73  78.86  84.16  86.58  85.10  86.69  
P  88.60  85.39  89.90  88.70  85.30  89.73  
FB15K237 + desc  
doc2vec  sentence2vec  
BEM  KG  BG  concat  KG  BG  concat  
TransE  O  85.32  75.62  87.92  85.32  83.42  88.43 
I  86.19  81.50  86.41  87.61  85.18  88.07  
P  87.68  81.52  87.86  88.05  85.82  88.57  
TransD  O  85.83  75.62  88.07  85.83  83.42  88.52 
I  86.75  81.44  86.85  87.96  84.97  88.07  
P  87.34  82.24  88.15  88.36  86.12  88.86 
We use TransE [5] and TransD [18] from OpenKE^{5}^{5}5https://github.com/thunlp/OpenKE [16] to pretrain KG’s embeddings. Both of them are trained for epochs with dimension and other parameters are taken as default. For the BGs, we use doc2vec [25] and sentence2vec [32] to pretrain desc BG embeddings, and node2vec [13] and LINE [39] to pretrain pagelink BG embeddings respectively. The dimension of the BG embedding is set to be . More details on the experiment and hyperparameter setups are included in Appendix B.
5.1.1 Node classification
In the node classification task, there are
class labels. The embeddings are fed into a multilabel logistic regression model for training and prediction. Table
3 shows the results of BEM, from which we can draw three implications. First, we observe consistent improvements of BEMP over BEMO (the original embedding) through almost all settings (accuracies boosted by for KG and BG). It indicates that we can benefit from integrating information of the two sources. Second, if the classifier is sufficiently expressive, concatO is expected to perform the best since there is no loss of information from the input. However, concatP turns out to perform slightly better than concatO in most cases. It suggests that
BEMP not only preserves the information for node classification, but also reveals signals. Third, as we expected, BEMP outperforms BEMI since the former accounts for the pairwise interactions that are crucial for the embedding learning of KG/BG. Finally, we point out that the concatenated embedding and the KG/BG embedding are not comparable. The concatenated embedding is longer than the BEMrefined embedding, so the classifier for the former has more parameters, thus more expressive. For a fair comparison, we study the projection of the concatenated embedding onto the BG/KG space, and the associative results are deferred to Appendix D.5.1.2 Empirical analysis
To understand the property of the embeddings refined by BEMP, we perform two empirical data analyses on the FB15K237pagelink dataset. First, we compute the absolute cosine similarity for each pair of nodes using KGO, KGP, BGO, BGP respectively. From Figure 3, we observe that the KGP and BGP are distributed more extremely than KGO and BGO — there are more highly correlated and more uncorrelated node pairs for the former. It indicates that BEMP enforces some nodes to group tightly while some others are distracted from each other. This result can also be concluded by the visualization of the embeddings using tSNE (Figure 4). Second, we use the class labels for the node classification task to compute
where , are two classes, and
This metric reflects the degree to which the topological structure of the embeddings aligns with the labels. We have , , and , indicating that BEMP enforces nodes in the same classes to get closer to each other while nodes across classes are pulled away. This result suggests that BEMP is able to preserve and further reveal the topological structure for both KG and BG.
5.1.3 Link Prediction and Triplet Classification on the KG side
We evaluate BEM on the link prediction and the triplet classification tasks. Since BEM can only refine the entity embeddings, we retrain the relation embedding for another epochs using BEMrefined KG embeddings and the original relation embeddings as the initial values. In Table 4, notice that the KG embeddings can also benefit from incorporating the BG information via the BEM refining. In contrast, the concatO embeddings are much inferior. It validates that the concatenation does not fully expose the topological structure of KG while BEM can make good use of this information. Moreover, we observe the improvement mainly occurs for the pagelink dataset. For the desc dataset, the TransD embeddings get improved slightly while the TransE embeddings get worse after applying the BEM refining. Such observation can be explained as the desc dataset does not provide supplementary interaction information to the KG graph.
Metrics  Embedding  FB15K237 + pagelink  FB15K237 + desc  
TransE  TransD  TransE  TransD  
node2vec  LINE  node2vec  LINE  doc2vec  sentence2vec  doc2vec  sentence2vec  

KGO  43.14  43.14  43.86  43.86  43.14  43.14  43.86  43.86  
KGI  42.25  43.00  44.31  44.56  41.86  42.05  42.31  44.58  
KGP  43.66  43.52  44.72  44.67  41.99  42.21  44.26  44.47  
concatO  36.99  37.47  38.32  38.45  40.17  40.07  40.79  37.83  

KGO  76.56  76.56  78.29  78.29  76.56  76.56  78.29  78.29  
KGI  76.70  76.86  78.54  78.80  76.06  76.42  78.63  78.61  
KGP  77.13  77.09  78.96  79.11  76.17  76.21  78.70  78.60  
concatO  71.97  73.23  71.82  70.75  71.41  71.32  72.15  69.91 
5.2 A LargeScale Dataset
#ent.  #scenario  #category  #rel.  #train 
17.37M  182K  8.96K  5.18K  60.65M 
#item  #value  #user  #edge_click  #edge_purchase 
9.14M  8.04M  482M  7,952M  144M 
In this section, we apply BEM to the KG/BG embeddings generated from an Alibaba Taobao’s largescale dataset^{6}^{6}6The details of the Alibaba Taobao’s dataset are deferred to Appendix A., whose statistics are summarized in Table 5. Considering the computational efficiency, TransE is used to get the KG embeddings on a knowledge database established by Alibaba Taobao. As with the BG embeddings, we run GraphSAGE on a graph constructed in terms of users’ behaviors, e.g., two items are connected if a certain number of customers bought them simultaneously over the past months. GraphSAGE is a representative work for graph neural network (GNN) and has achieved good performances for large datasets. The dimension of KG embedding and the dimension of the BG embedding are , as the online setting of Alibaba Taobao. We take the recommendation task for evaluation. Specifically, each customer has a set of trigger items from his/her historical behaviors including clicks, purchases, addtopreferences and addtocarts. These trigger items are then used to retrieve (by FAISS [19]) more items based on the BG embeddings. We evaluate our method by counting the number of retrieved items that will be actually bought/clicked by the user in the following days. Table 6 exhibits the hit recall rates of the BGP and BGO on the recommendation task.
We check whether the retrieved items are of the same brand/category as those actually bought/clicked items in the following days. Combining these two granularities, we observe that the hit recall rates for BGP are boosted by  compared to BGO, which is quite significant considering there are over million items. It validates that BEMP is able to incorporate useful KG information into the BG embedding for the item recommendation purpose.
Finally, for each concept/scenario, we use TransE to predict its top item categories based on KGO and KGP (see the detailed procedure as Section 5.1.3). The result shows that KGP can find more related items for the given concepts, as shown in Table 7. It indicates that by incorporating the BG information via BEM, we can acquire novel knowledge that does not exist in the original KG.
Granularity  Hit @  click  buy  
BEMO  BEMP  BEMO  BEMP  
brand  10  15.97  16.14  24.87  25.10 
30  16.65  17.12  25.70  26.57  
50  17.26  17.90  26.39  27.33  
category  10  27.46  27.40  27.85  27.91 
30  28.43  29.99  28.50  29.45  
50  29.58  32.88  29.26  31.47 
concept  predicted categories using KGO  predicted categories using KGP  
neuter clothing  jacket, homewear 


sports training  None 


household items 


6 Discussion
In this paper, we introduce BEM, a Bayesian framework that can refine graph embeddings by integrating the information from the KG and BG sources. BEM has been evaluated on a variety of experiments. It is shown to be able to improve the embeddings on multiple tasks by leveraging the information from the other side. BEM can achieve superior or comparable performance with higher efficiency to the concatenation method (the baseline) for the node classification task, and can help in other tasks where the simple aggregation methods (e.g., concatenation) are not applicable. It is designed by bridging KG and BG via a Bayesian generative model, where the former is regarded as the prior while the latter is the observation.
Currently, only one BG is considered at a time in this work. In fact, BEM can be easily extended to deal with multiple BGs. The integration of more than one BGs may further refine the KG, as their behaviorspecific biases can be mutually canceled out. Besides, for the time being, BEM works only for pretrained KG/BG embeddings. It can be potentially extended so that the networks for the KG/BG embeddings are connected and jointly trained via this framework. In other words, BEM can act as an interface that connects any KG embedding method with any BG embedding method for the endtoend training. This makes the learning of the BG embedding supervised by the KG information. In turn, the learning of the KG embedding can be supplemented with instantiated samples in BG.
Appendix
Appendix A Dataset Details
The data of our experiments based on public datasets mainly includes FB15K237, pagelink network, descriptions of entities and labels of entities. Their sources are discussed below.
a.1 Small datasets
The two small datasets share KG but differ in the BGs. Their relations are depicted in Figure 5.
Knowledge Graph We use FB15k237, a subset of Freebase, as the knowledge graph, which is also used in ConvE [10]. Different from the popular data set FB15k used in many knowledge graph representation researches, it does not include the inverse relations that may cause leakage from the training set to the validation set. FB15k237 has 14,541 entities, 237 relations, 272,115 training triples, 20,466 test triples and 17,535 validation triples.
Pagelink Network The pagelink network is a directed graph generated by ourselves. Since FB15k is a subset of Freebase, we first map the entities of FB15k to wikidata, that is a knowledge database to provide support for Wikipedia, Wikimedia Commons. according to the mapping data on the freebase database [1]. Then we use the pagelinks in English wikipedia to build the pagelink network. Since we could not get all the data, entities in the pagelink network are fewer than them in the knowledge graph. The pagelink network has 14,071 vertices and 1,065,412 edges in total.
Descriptions of Entities The descriptions used in our experiments are the same as DKRL [44]. It has 14,904 English descriptions of entities.
Labels of Entities In wikidata, the property ’instance of’ is an isA relation which represents the class that the entity belongs to. Therefore, we use the property values of ’instance of’ to represent the labels of entities used in the node classification task. At the same time, we also consider the problem of information leakage. In Freebase, the relation ’type/object/type’ represents the type of an entity. To avoid that this relation may leak information to evaluation tasks, we check that the relation ’type/object/type’ is not used in the triples of training set.
a.2 Large dataset
Knowledge Graph of Alibaba Taobao The knowledge graph of Alibaba Taobao items shows a tree structure. It contains four types of entities: items, categories items belong to, scenes of the categories, and the attribute values of the items. Therefore, there are three types of triples:

,

,

.
Among the above three types of triplets, the first one is NN mapping, the second one is 1N mapping and the third one is NN mapping.
Behavior Graph of Alibaba Taobao The behavior graph of Alibaba Taobao is a bipartite graph that contains both user and item nodes. Interactions between users and items are CLICK or BUY which were sampled from a slicing window of 2 weeks (Dec. 27th, 2018  Jan. 10th, 2019). The data of the first week was used for training. We used the trained model to recommend items for users with trigger items collected on Jan. 5th, 2019, and checked whether these recommended items were really clicked/bought in the following week.
Each user has specific features describing their certain properties, e.g. age, gender, occupation, preference towards some category of items, the recently clicked items, and each item has features like price, category, brand, etc. Edges (interactions) have weights that decay with time. When learning the node embedding of the behavior graph, we use the edges between the user and the item as positive samples and randomly corrupted edges as negative samples. Node features are incorporated alone with edges in the training phase.
Appendix B Further details on functions
To get embeddings of different data sets, we use several functions. The details of them are shown below.
TransE TransE is a typical knowledge graph representation method [5]. It treats relations in knowledge graph as translating operators from head entities to tail entities, which is represented as
(13) 
In this work, we use the TransE API offered by [16] to get embeddings of entities in knowledge graph.
node2vec Node2vec is a network representation framework [13]. It uses a biased random walk procedure to preserve the neighborhood information of the network in node representation. We believe the neighborhood information in the pagelink network can help characterize an entity, so we use it to generate vertex embeddings of pagelink network. In our experiment, we set the parameters as follows: the length of walk is 80, the number of walks is 10, the context size is 10.
LINE LINE is a network representation method [39]. It preserves the firstorder and secondorder proximities in a network. In this work, we use the LINE API offered by OpenKE to get entity embeddings in the pagelink network. In our experiment, we set the negative ratio is 5, and uses both the 1storder and the 2ndorder proximity of graphs.
doc2vec Doc2vec is an unsupervised framework to get embeddings of given sentences or paragraphs [25]. Embeddings of documents are trained to predict the words according to its context in the documents. We use it to get entity embeddings based on entity descriptions. In our experiment, we use PVDM (Distributed Memory Model of paragraph vectors) to get the embeddings of documents.
sentence2vec Sentence2vec is an unsupervised, CBOWinspired framework to get embeddings of sentences or paragraphs [32]. It has been proven to have a stateoftheart performance in sentence similarity comparison task. Therefore, we use it to generate entity embeddings based on descriptions for the purpose of reconstruct the graph based on vertex similarity. In our experiment, we set the parameters as follows: the learning rate is 0.2, the update rate of learning rate is 100, the number of epochs is 5, the minimal number of word occurrences is 5, the minimal number of label occurrences is 0, the max length of word gram is 2.
GraphSAGE GraphSAGE is an inductive representation learning framework. Unlike transductive graph embedding frameworks that only generate embeddings for seen nodes, GraphSAGE leverages node attribute information to learn node embeddings in a generalized way and thus is capable of generating representations on unseen data. We use GraphSAGE to learn node embeddings on Alibaba Taobao’s Behavior Graph.
Appendix C The selection of parameters for Algorithm 1
To understand how the tuning parameters influences the performance of BEMP, we apply Algorithm 1 to pretrained FB15K237 embeddings (KG) obtained by TransE and pretrained pagelink embeddings obtained by node2vec. Each time we only change one parameter based on the default setup mentioned in Section 5, i.e., , , , , . The associative results of link prediction and triplet classification are displayed in Table 8. We can draw a few conclusions from such results:

, and the number of training steps affect the BEMP marginally. It indicates that BEMP does not require high model complexity for expressiveness and converges quickly.

The learning rate is worth tuning as other gradientbased algorithms.

The most important parameters are and . As explained in the last paragraph of Section 4.3, they balance the reconstruction term and the penalty term in Equation (8) and Equation (LABEL:eq:penaltyindptpair). Tuning and based on a validation set might give significant boost in performance. But if the user wants to skip tuning, and can be the good starting point.
For the node classification task, we get similar results using the same dataset.
Hit@10 (10%) in LPFiltered  Accuracy (%) in TC  
200  500  800  200  500  800  
evaluation result  43.38  43.66  43.41  77.25  77.13  77.19  
100  500  1000  100  500  1000  
evaluation result  43.77  43.66  42.63  77.40  77.13  77.28  
learning rate  0.0001  0.001  0.005  0.01  0.0001  0.001  0.005  0.01 
evaluation result  42.53  43.66  44.95  45.10  76.59  77.13  77.89  77.74 
10  20  50  100  10  20  50  100  
evaluation result  43.77  43.66  44.03  43.83  76.55  77.13  77.39  76.97 
0.01  0.1  1  5  0.01  0.1  1  5  
evaluation result  44.91  45.82  43.66  31.90  78.21  79.33  77.13  71.08 
0.01  0.1  1  5  0.01  0.1  1  5  
evaluation result  41.59  41.74  43.66  44.35  76.11  76.61  77.13  77.41 
Appendix D More results of the node classification task on the FB15K237 dataset with two associative BGs
It is unfair to compare the BEMrefined embedding to the concatenated embedding directly, since the latter is longer than the former. In our case, the length of the concatenated embedding is () times longer than that of the KG (BG) embedding. Thus, the classifier for the concatenation has () more parameters than that of the KG (BG) embedding. To get a fair comparison, we project the concatenated embedding into () using a random Gaussian projection matrix, which can nearly preserve the distances between nodes.
Table 9 and 10 illustrate the results of BEM with its variations on the node classification results. Four implications can be drawn by looking at the table in different ways. First, we observe consistent improvements of BEMP over BEMO through all settings. The classification accuracies on the BG (KG) embedding are boosted by about  with BEMP. As for the concatenation version, the concatO vector is expected to work better than embeddings by BEM if the classifier is expressive enough — there might be loss of information during the procedure of the BEM integration. However, it turns out that concatP outperforms concatO. It indicates that BEMP does not lose information related to the classification task, and is able to make the embeddings into a better shape for the classification task. Second, for a fair comparison in terms of the dimension, we use Gaussian random projections (repeated for times) to project the concatenated embedding to and , respectively. KGP is superior to the projections of concatO (for both and ), and is even comparable to concatO. From the perspective of dimension reduction, this result suggests that BEMP can preserve the majority of information for KG. On the other hand, considering the goal of preserving the topological structure, BEMP is unlikely to boost the performance of lowquality BGO to the level of concatO. Third, we note the projections of concatP loses marginal power during the dimension reduction, and are more robust than the projections of concatO. It indicates that the BEMP representation is less noisy than the original embeddings. Finally, as we expect, BEMP outperforms the BEMI where the former accounts for the pairwise interactions. Such key information is crucial for the learning of both the KG and BG embeddings.
node2vec  
BEM  KG  BG  concat  concat  concat  
TransE  O  85.59  75.12  82.86 (0.93)  87.58 (0.27)  89.39 
I  85.51  82.56  83.96 (0.35)  85.39 (0.16)  85.97  
P  88.89  86.32  88.71 (0.20)  89.27 (0.16)  90.29  
TransD  O  86.06  75.12  82.37 (0.82)  86.24 (0.26)  89.18 
I  83.73  78.86  81.67 (0.80)  83.83 (0.28)  84.16  
P  88.60  85.39  87.83 (0.26)  88.92 (0.27)  89.90  
LINE  
BEM  KG  BG  concat  concat  concat  
TransE  O  85.59  77.57  79.49 (1.14)  86.54 (0.40)  89.44 
I  86.35  85.44  85.65 (0.34)  86.63 (0.14)  87.05  
P  88.21  86.27  88.21 (0.48)  89.12 (0.15)  90.01  
TransD  O  86.06  77.57  77.51 (1.16)  85.80 (1.03)  89.00 
I  86.58  85.10  85.36 (0.35)  86.40 (0.18)  86.69  
P  88.70  85.30  87.95 (0.44)  88.82 (0.15)  89.73 
are the standard errors across
random projections.doc2vec  
BEM  KG  BG  concat  concat  concat  
TransE  O  85.32  75.62  83.87 (0.63)  86.77 (0.23)  87.92 
I  86.19  81.50  85.32 (0.48)  86.10 (0.08)  86.41  
P  87.68  81.52  86.40 (0.21)  87.78 (0.21)  87.86  
TransD  O  85.83  75.62  83.19 (0.62)  86.57 (0.35)  88.07 
I  86.75  81.44  85.48 (0.73)  86.52 (0.13)  86.85  
P  87.34  82.24  86.31 (0.43)  87.57 (0.19)  88.15  
sentence2vec  
BEM  KG  BG  concat  concat  concat  
TransE  O  85.32  83.42  86.67 (0.39)  87.58 (0.24)  88.43 
I  87.61  85.18  86.95 (0.31)  87.70 (0.14)  88.07  
P  88.05  85.82  87.61 (0.16)  88.36 (0.24)  88.57  
TransD  O  85.83  83.42  85.69 (0.32)  87.83 (0.16)  88.52 
I  87.96  84.97  86.89 (0.4)  87.91 (0.24)  88.07  
P  88.36  86.12  87.40 (0.20)  89.59 (0.18)  88.86 
References
 [1] Data dumps — freebase api. https://developers.google.com/freebase/. December 20, 2018.
 [2] P. Battaglia, R. Pascanu, M. Lai, and D. J. Rezende. Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems, pages 4502–4510, 2016.
 [3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
 [4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 [5] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
 [6] Kenneth P Burnham and David R Anderson. Kullbackleibler information as a basis for strong inference in ecological studies. Wildlife research, 28(2):111–119, 2001.
 [7] Hongyun Cai, Vincent W Zheng, and Kevin Chang. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering, 2018.
 [8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891–900. ACM, 2015.
 [9] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 2018.

[10]
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel.
Convolutional 2d knowledge graph embeddings.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [11] A. Fout, J. Byrd, B. Shariat, and A. BenHur. Protein interface prediction using graph convolutional networks. In Neural Information Processing Systems, pages 6530–6539, 2017.
 [12] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall/CRC, 1995.
 [13] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 [14] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto. Knowledge transfer for outofknowledgebase entities : A graph neural network approach. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pages 1802–1808, 2017.
 [15] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
 [16] Xu Han, Shulin Cao, Lv Xin, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. Openke: An open toolkit for knowledge embedding. In Proceedings of EMNLP, 2018.

[17]
Matthew D Hoffman and Matthew J Johnson.
Elbo surgery: yet another way to carve up the variational evidence
lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NIPS
, 2016. 
[18]
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao.
Knowledge graph embedding via dynamic mapping matrix.
In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, volume 1, pages 687–696, 2015.  [19] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.

[20]
Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song.
Learning combinatorial optimization algorithms over graphs.
In Advances in Neural Information Processing Systems 30, pages 6348–6358. 2017.  [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [22] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [23] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 [24] Marius Kloft and Gilles Blanchard. The local rademacher complexity of lpnorm multiple kernel learning. In Advances in Neural Information Processing Systems, pages 2438–2446, 2011.

[25]
Quoc Le and Tomas Mikolov.
Distributed representations of sentences and documents.
In
International Conference on Machine Learning
, pages 1188–1196, 2014.  [26] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2181–2187, 2015.
 [27] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, volume 15, pages 2181–2187, 2015.
 [28] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.
 [29] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
 [30] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. Stranse: a novel embedding model of entities and relationships in knowledge bases. In HLTNAACL, pages 460–466. The Association for Computational Linguistics, 2016.
 [31] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In KDD, 2016.
 [32] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised Learning of Sentence Embeddings using Compositional nGram Features. In NAACL 2018  Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
 [33] John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
 [34] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 [35] Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han. An attentionbased collaboration framework for multiview network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1767–1776. ACM, 2017.
 [36] A. SanchezGonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. In arXiv preprint, page 1806.01242, 2018.
 [37] Yu Shi, Fangqiu Han, Xinwei He, Xinran He, Carl Yang, Jie Luo, and Jiawei Han. mvn2vec: Preservation and collaboration in multiview network embedding. arXiv preprint arXiv:1801.06597, 2018.

[38]
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng.
Reasoning with neural tensor networks for knowledge base completion.
In Advances in neural information processing systems, pages 926–934, 2013.  [39] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 [40] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817–826. ACM, 2009.
 [41] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.

[42]
Z Wang, J Zhang, J Feng, and Z Chen.
Knowledge graph embedding by translating on hyperplanes.
In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI16), pages 985–991, 2014.  [43] Jiawei Wu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Knowledge representation via joint learning of sequential text and knowledge graphs. arXiv preprint arXiv:1609.07075, 2016.
 [44] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learning of knowledge graphs with entity descriptions. In AAAI, pages 2659–2665, 2016.
 [45] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. Network representation learning with rich text information. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 [46] Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cunchao Tu. Fast network embedding enhancement via high order proximity approximation. In IJCAI, pages 3894–3900, 2017.
 [47] Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, and R Bharat Rao. Bayesian cotraining. Journal of Machine Learning Research, 12(Sep):2649–2680, 2011.
 [48] Deming Zhai, Hong Chang, Shiguang Shan, Xilin Chen, and Wen Gao. Multiview metric learning with global consistency and local smoothness. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):53, 2012.
Comments
There are no comments yet.