End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion

11/11/2018 ∙ by Chao Shang, et al. ∙ JD.com, Inc. University of Connecticut 4

Knowledge graph embedding has been an active research topic for knowledge base completion, with progressive improvement from the initial TransE, TransH, DistMult et al to the current state-of-the-art ConvE. ConvE uses 2D convolution over embeddings and multiple layers of nonlinear features to model knowledge graphs. The model can be efficiently trained and scalable to large knowledge graphs. However, there is no structure enforcement in the embedding space of ConvE. The recent graph convolutional network (GCN) provides another way of learning graph node embedding by successfully utilizing graph connectivity structure. In this work, we propose a novel end-to-end Structure-Aware Convolutional Networks (SACN) that take the benefit of GCN and ConvE together. SACN consists of an encoder of a weighted graph convolutional network (WGCN), and a decoder of a convolutional network called Conv-TransE. WGCN utilizes knowledge graph node structure, node attributes and relation types. It has learnable weights that collect adaptive amount of information from neighboring graph nodes, resulting in more accurate embeddings of graph nodes. In addition, the node attributes are added as the nodes and are easily integrated into the WGCN. The decoder Conv-TransE extends the state-of-the-art ConvE to be translational between entities and relations while keeps the state-of-the-art performance as ConvE. We demonstrate the effectiveness of our proposed SACN model on standard FB15k-237 and WN18RR datasets, and present about 10 improvement over the state-of-the-art ConvE in terms of HITS@1, HITS@3 and HITS@10.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

SACN-PyTorch

End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion


view repo

SACN

End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Over the recent years, large-scale knowledge bases (KBs), such as Freebase [Bollacker et al.2008], DBpedia [Auer et al.2007], NELL [Carlson et al.2010] and YAGO3 [Mahdisoltani, Biega, and Suchanek2013], have been built to store structured information about common facts. KBs are multi-relational graphs whose nodes represent entities and edges represent relationships between entities, and the edges are labeled with different relations. The relationships are organized in the forms of triplets (e.g. entity = Abraham Lincoln, relation = DateOfBirth, entity = 02-12-1809). These KBs are extensively used for web search, recommendation and question answering. Although these KBs have already contained millions of entities and triplets, they are far from complete compared to existing facts and newly added knowledge of the real world. Therefore knowledge base completion is important in order to predict new triplets based on existing ones and thus further expand KBs.

One of the recent active research areas for knowledge base completion is knowledge graph embedding: it encodes the semantics of entities and relations in a continuous low-dimensional vector space (called embeddings). These embeddings are then used for predicting new relations. Started from a simple and effective approach called

TransE [Bordes et al.2013], many knowledge graph embedding methods have been proposed, such as TransH [Wang et al.2014], TransR [Lin et al.2015], DistMult [Yang et al.2014], TransD [Ji et al.2015], ComplEx [Trouillon et al.2016], STransE [Nguyen et al.2016]. Some surveys [Nguyen2017, Wang et al.2017] give details and comparisons of these embedding methods.

The most recent ConvE [Dettmers et al.2017] model uses 2D convolution over embeddings and multiple layers of nonlinear features, and achieves the state-of-the-art performance on common benchmark datasets for knowledge graph link prediction. In ConvE, the embeddings of and are reshaped and concatenated into an input matrix and fed to the convolution layer. Convolutional filters of are used to output feature maps that are across different dimensional embedding entries. Thus ConvE does not keep the translational property as TransE which is an additive embedding vector operation: ([Nguyen et al.2017]). In this paper, we remove the reshape step of ConvE and operate convolutional filters directly in the same dimensions of and . This modification gives better performance compared with the original ConvE, and has an intuitive interpretation which keeps the global learning metric the same for , , and in an embedding triple . We name this embedding as Conv-TransE.

ConvE also does not incorporate connectivity structure in the knowledge graph into the embedding space. In contrast, graph convolutional network (GCN) has been an effective tool to create node embeddings which aggregate local information in the graph neighborhood for each node [Kipf and Welling2016b, Hamilton, Ying, and Leskovec2017a, Kipf and Welling2016a, Pham et al.2017, Shang et al.2018]. GCN models have additional benefits [Hamilton, Ying, and Leskovec2017b], such as leveraging the attributes associated with nodes. They can also impose the same aggregation scheme when computing the convolution for each node, which can be considered a method of regularization, and improves efficiency. Although scalability is originally an issue for GCN models, the latest data-efficient GCN, PinSage [Ying et al.2018], is able to handle billions of nodes and edges.

In this paper, we propose an end-to-end graph Structure-Aware Convolutional Networks (SACN) that take all benefits of GCN and ConvE together. SACN consists of an encoder of a weighted graph convolutional network (WGCN), and a decoder of a convolutional network called Conv-TransE. WGCN utilizes knowledge graph node structure, node attributes and relation types. It has learnable weights to determine the amount of information from neighbors used in local aggregation, leading to more accurate embeddings of graph nodes. Node attributes are added to WGCN as additional for easy integration. The output of WGCN becomes the input of the decoder Conv-TransE. Conv-TransE is similar to ConvE but with the difference that Conv-TransE keeps the translational characteristic between entities and relations. We show that Conv-TransE performs better than ConvE, and our SACN improves further on top of Conv-TransE in the standard benchmark datasets. The code for our model and experiments is publicly available 111https://github.com/JD-AI-Research-Silicon-Valley/SACN.

Our contributions are summarized as follows:

  • We present an end-to-end network learning framework SACN that takes benefit of both GCN and Conv-TransE. The encoder GCN model leverages graph structure and attributes of graph nodes. The decoder Conv-TransE simplifies ConvE with special convolutions and keeps the translational property of TransE and the prediction performance of ConvE;

  • We demonstrate the effectiveness of our proposed SACN on the standard FB15k-237 and WN18RR datasets, and show about 10% relative improvement over the state-of-the-art ConvE in terms of HITS@1, HITS@3 and HITS@10.

Related Work

Knowledge graph embedding learning has been an active research area with applications directly in knowledge base completion (i.e. link prediction) and relation extractions. TransE [Bordes et al.2013] started this line of work by projecting both entities and relations into the same embedding vector space, with translational constraint of . Later works enhanced KG embedding models such as TransH [Wang et al.2014], TransR [Lin et al.2015], and TransD [Ji et al.2015] introduced new representations of relational translation and thus increased model complexity. These models were categorized as translational distance models [Wang et al.2017] or additive models, while DistMult [Yang et al.2014] and ComplEx [Trouillon et al.2016] are multiplicative models [Sharma, Talukdar, and others2018], due to the multiplicative score functions used for computing entity-relation-entity triplet likelihood.

The most recent KG embedding models are ConvE [Dettmers et al.2017] and ConvKB [Nguyen et al.2017]. ConvE was the first model using 2D convolutions over embeddings of different embedding dimensions, with the hope of extracting more feature interactions. ConvKB replaced 2D convolutions in ConvE with 1D convolutions, which constrains the convolutions to be the same embedding dimensions and keeps the translational property of TransE. ConvKB can be considered as a special case of Conv-TransE that only uses filters with width equal to . Although ConvKB was shown to be better than ConvE, the results on two datasets (FB15k-237 and WN18RR) were not consistent, so we leave these results out of our comparison table. The other major difference of ConvE and ConvKB

is on the loss functions used in the models.

ConvE used the cross-entropy loss that could be sped up with 1-N scoring in the decoder, while ConvKB used a hinge loss that was computed from positive examples and sampled negative examples. We take the decoder from ConvE because we can easily integrate the encoder of GCN and the decoder of ConvE into an end-to-end training framework, while ConvKB is not suitable for our approach.

These embedding models achieved good performance for knowledge base completion in terms of efficiency and scalability. However, these approaches only modeled relational triplets, while ignoring a large number of attributes associated with graph nodes, e.g., ages of people or release region of music. Furthermore, these models do not enforce any large-scale connectivity structure in the embedding space, and totally ignore the knowledge graph structure. The proposed (SACN) handles these two problems in an end-to-end training framework, by using a variant of graph convolutional network (GCN) as the encoder, and a variant of ConvE as the decoder.

Figure 1: An illustration of our end-to-end Structure-Aware Convolutional Networks model. For encoder, a stack of multiple WGCN layers builds an entity/node embedding matrix. For decoder, and are fed into Conv-TransE. The output embeddings are vectorized and projected, and matched with all candidate

embeddings via inner products. A logistic sigmoid function is used to get the scores.

GCNs were first proposed in [Bruna et al.2013] where graph convolutional operations were defined in the Fourier domain. The eigendecomposition of the graph Laplacian caused intense computation. Later, smooth parametric spectral filters [Henaff, Bruna, and LeCun2015, Defferrard, Bresson, and Vandergheynst2016] were introduced to achieve localization in the spatial domain and improve computational efficiency. Recently, Kipf et al. [Kipf and Welling2016b] simplified these spectral methods by a first-order approximation with the Chebyshev polynomials. The spatial graph convolution approaches [Hamilton, Ying, and Leskovec2017a] define convolutions directly on graph, which sum up node features over all spatial neighbors using adjacency matrix.

GCN models were mostly criticized for its huge memory requirement to scale to massive graphs. However, [Ying et al.2018] developed a data efficient GCN algorithm called PinSage, which combined efficient random walks and graph convolutions to generate embeddings of nodes that incorporated both graph structure as well as node features. The experiments on Pinterest data were the largest application of deep graph embeddings to date with 3 billion nodes and 18 billion edges [Ying et al.2018]. This success paves the way for a new generation of web-scale recommender systems based on GCNs. Therefore we believe that our proposed model could take advantage of huge graph structures and high computational efficiency of Conv-TransE.

Method

In this section, we describe the proposed end-to-end SACN. The encoder WGCN is focused on representing entities by aggregating connected entities as specified by the relations in the KB. With node embeddings as the input, the decoder Conv-TransE network aims to represent the relations more accurately by recovering the original triplets in the KB. Both encoder and decoder are trained jointly by minimizing the discrepancy (cross-entropy) between the embeddings and to preserve the translational property . We consider an undirected graph throughout this section, where is a set of nodes with , and is a set of edges with .

Weighted Graph Convolutional Layer

The WGCN is an extension of classic GCN [Kipf and Welling2016b] in the way that it weighs the different types of relations differently when aggregating and the weights are adaptively learned during the training of the network. By this adaptation, the WGCN can control the amount of information from neighboring nodes used in aggregation. Roughly speaking, the WGCN treats a multi-relational KB graph as multiple single-relational subgraphs where each subgraph entails a specific type of relations. The WGCN determines how much weights to give to each subgraph when combining the GCN embeddings for a node.

The -th WGCN layer takes the output vector of length for each node from the previous layer as inputs and generates a new representation comprising elements. Let represent the input (row) vector of the node in the -th layer, and thus be the input matrix for this layer. The initial embedding is randomly drawn from Gaussian. If there are a total of layers in the WGCN, the output of the -th layer is the final embedding. Let the total number of edge types be in a multi-relational KB graph with edges. The interaction strength between two adjacent nodes is determined by their relation type and this strength is specified by a parameter

for each edge type, which is automatically learned in the neural network.

Figure 1 illustrates the entire process of SACN. In this example, the WGCN layers of the network compute the embeddings for the red node in the middle graph. These layers aggregate the embeddings of neighboring entity nodes as specified in the KB relations. Three colors (blue, yellow and green) of the edges indicate three different relation types in the graph. The corresponding three entity nodes are summed up with different weights according to in this layer to obtain the embedding of the red node. The edges with the same color (same relation type) use the same . Each layer has its own set of relation weights . Hence, the output of the -th layer for the node can be written as follows:

(1)

where is the input for node , and is a node in the neighbor of node . The

function specifies how to incorporate neighboring information. Note that the activation function

here is applied to every component of its input vector. Although any function suitable for a KB embedding can be used in conjunction with the proposed framework, we implement the following function:

(2)

where

is the connection coefficient matrix and used to linearly transform

to .

In Eq. (1), the input vectors of all neighboring nodes are summed up but not the node itself, hence self-loops are enforced in the network. For node , the propagation process is defined as:

(3)

The output of the layer is a node feature matrix: , and is the -th row of , which represents features of the node in the -th layer.

The above process can be organized as a matrix multiplication as shown in Figure 2 to simultaneously compute embeddings for all nodes through an adjacency matrix. For each relation (edge) type, an adjacency matrix is a binary matrix whose -th entry is 1 if an edge connecting and exists or 0 otherwise. The final adjacency matrix is written as follows:

(4)

where

is the identity matrix of size

. Basically, the is the weighted sum of the adjacency matrices of subgraphs plus self-connections. In our implementation, we consider all first-order neighbors in the linear transformation for each layer as shown in Figure 2:

(5)
Figure 2: A weighted graph convolutional network (WGCN) for entity embedding.

Node Attributes.

In a KB graph, nodes are often associated with several attributes in the form of . For example, ( Tom, people.person.gender, male) is an instance where gender is an attribute associated with a person. If a vector representation is used for node attributes, there would be two potential problems. First, the number of attributes for each node is usually small, and differs from one to another. Hence, the attribute vector would be very sparse. Second, the value of zero in the attribute vectors may have ambiguous meanings: the node does not have the specific attribute, or the node misses the value for this attribute. These zeros would affect the accuracy of the embedding.

In this work, the entity attributes in the knowledge graph are represented by another set of nodes in the network called attribute nodes. Attribute nodes act as the “bridges” to link the related entities. The entity embeddings can be transported over these “bridges” to incorporate the entity’s attribute into its embedding. Because these attributes exhibit in triplets, we represent the attributes similarly to the representation of the entity in relation triplets. Note that each type of attribute corresponds to a node. For instance, in our example, gender is represented by a single node rather than two nodes for “male” and “female”. In this way, the WGCN not only utilizes the graph connectivity structure (relations and relation types), but also leverages the node attributes (a kind of graph structure) effectively. That is why we name our WGCN as a structure-aware convolution network.

Conv-TransE

We develop the Conv-TransE model as a decoder that is based on ConvE but with the translational property of TransE: . The key difference of our approach from ConvE is that there is no reshaping after stacking and . Filters (or kernels) of size , , are used in the convolution. The example in Figure 1 uses kernels to compute 2D convolutions. We experimented with several of such settings in our empirical study.

Note that in the encoder of SACN, the dimension of the relation embedding is commonly chosen to be the same as the dimension of the entity embedding, so in other words, is equal to . Hence, the two embeddings can be stacked. For the decoder, the inputs are two embedding matrices: one from WGCN for all entity nodes, and the other for relation embedding matrix which is trained as well. Because we use a mini-batch stochastic training algorithm, the first step of the decoder performs a look-up operation upon the embedding matrices to retrieve the input and for the triplets in the mini-batch.

More precisely, given different kernels where the -th kernel is parameterized by , the convolution in the decoder is computed as follows:

(6)

where is the kernel width, indexes the entries in the output vector and , and the kernel parameters are trainable. and

are padding version of

and respectively. If the dimension

of kernel is odd, the first

and last components are filled with 0. Here returns the floor of . Otherwise, the first and last components are filled with 0. Other components are copied from and directly. As shown in Eq. (6) the convolution operation amounts to a sum of and after the one-dimensional convolution. Hence, it preserves the translational property of the embeddings of , . The output forms a vector . Aligning the output vectors from the convolution with all kernels yield a matrix .

Model Scoring Function
TransE
DistMult
ComplEx
ConvE
ConvKB
SACN
Table 1: Scoring function . Here and denote a 2D reshaping of and .

Finally, the scoring function for the Conv-TransE method after the nonlinear convolution is defined as below:

(7)

where is a matrix for the linear transformation, and denotes a non-linear function. The feature map matrix is reshaped into a vector and projected into a dimensional space using for linear transformation. Then the calculated embedding is matched to by an appropriate distance metric. During the training in our experiments, we apply the logistic sigmoid function to the scoring:

(8)

In Table 1, we summarize the scoring functions used by several state of the art models. The vector and are the subject and object embedding respectively, is the relation embedding, “concat” means concatenates the inputs, and “*” denotes the convolution operator.

In summary, the proposed SACN model takes advantage of knowledge graph node connectivity, node attributes and relation types. The learnable weights in WGCN help to collect adaptive amount of information from neighboring graph nodes. The entity attributes are added as additional nodes in the network and are easily integrated into the WGCN. Conv-TransE keeps the translational property between entities and relations to learn node embeddings for the link prediction. We also emphasize that our SACN has significant improvements over ConvE with or without the use of node attributes.

Experiments


Dataset FB15k-237 WN18RR FB15k-237-Attr
Entities 14,541 40,943 14,744
Relations 237 11 484
Train Edges 272,115 86,835 350,449
Val. Edges 17,535 3,034 17,535
Test Edges 20,466 3,134 20,466
Attributes Triples 78,334
Attributes 203
Table 2: Statistics of datasets.

Benchmark Datasets

Three benchmark datasets (FB15k-237, WN18RR and FB15k-237-Attr) are utilized in this study to evaluate the performance of link prediction.

FB15k-237. The FB15k-237 [Toutanova and Chen2015] dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs, as used in the work published in [Toutanova and Chen2015]. The knowledge base triples are a subset of the FB15K [Bordes et al.2013], originally derived from Freebase. The inverse relations are removed in FB15k-237.
WN18RR. WN18RR [Dettmers et al.2017] is created from WN18 [Bordes et al.2013], which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples obtained by inverting triples from the training set. Thus WN18RR dataset [Dettmers et al.2017] is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.


FB15k-237 WN18RR
Hits Hits
Model @10 @3 @1 MRR @10 @3 @1 MRR
DistMult [Yang et al.2014] 0.42 0.26 0.16 0.24 0.49 0.44 0.39 0.43
ComplEx [Trouillon et al.2016] 0.43 0.28 0.16 0.25 0.51 0.46 0.41 0.44
R-GCN [Schlichtkrull et al.2018] 0.42 0.26 0.15 0.25
ConvE [Dettmers et al.2017] 0.49 0.35 0.24 0.32 0.48 0.43 0.39 0.46
Conv-TransE 0.51 0.37 0.24 0.33 0.52 0.47 0.43 0.46
SACN 0.54 0.39 0.26 0.35 0.54 0.48 0.43 0.47
SACN using FB15k-237-Attr 0.55 0.40 0.27 0.36
Performance Improvement 12.2% 14.3% 12.5% 12.5% 12.5% 11.6% 10.3% 2.2%
Table 3: Link prediction for FB15k-237, WN18RR and FB15k-237-Attr datasets.

Data Construction

Most of the previous methods only model the entities and relations, and ignore the abundant entity attributes. Our method can easily model a large number of entity attribute triples. In order to prove the efficiency, we extract the attribute triples from the FB24k [Lin, Liu, and Sun2016] dataset to build the evaluation dataset called FB15k-237-Attr.

FB24k. FB24k [Lin, Liu, and Sun2016] is built based on Freebase dataset. FB24k only selects the entities and relations which constitute at least 30 triples. The number of entities is 23,634, and the number of relations is 673. In addition, the reversed relations are removed from the original dataset. In the FB24k datasets, the attribute triples are provided. FB24k contains 207,151 attribute triples and 314 attributes.
FB15k-237-Attr. We extract the attribute triples of entities in FB15k-237 from FB24k. During the mapping, there are 7,589 nodes from the original 14,541 entities which have the node attributes. Finally, we extract 78,334 attribute triples from FB24k. These triples include 203 attributes and 247 relations. Based on these triples, we create the “FB15k-237-Attr” dataset, which includes 14,541 entity nodes, 203 attribute nodes, 484 relation types. All the 78,334 attribute triples are combined with the training set of FB15k-237.

Experimental Setup

The hyperparameters in our

Conv-TransE and SACN models are determined by a grid search during the training. We manually specify the hyperparameter ranges: learning rate , dropout rate , embedding size , number of kernels , and kernel size .

Here all the models use the WGCN with two layers. For different datasets, we have found that the following settings work well: for FB15k-237, set the dropout to 0.2, number of kernels to 100, learning rate to 0.003 and embedding size to 200 for SACN; for WN18RR dataset, set dropout to 0.2, number of kernels to 300, learning rate to 0.003, and embedding size to 200 for SACN. When using the Conv-TransE-alone model, these settings still work well.

Each dataset is split into three sets for: training, validation and testing, which is same with the setting of the original ConvE

. We use the adaptive moment (Adam) algorithm

[Kingma and Ba2014]

for training the model. Our models are implemented by PyTorch and run on NVIDIA Tesla P40 Graphics Processing Units. For the FB15k-237 dataset, the computation time of

SACN

for each epoch is about 1 minute. For the WN18RR, the computation time of

SACN for one epoch is about 1.5 minutes.

Results

Evaluation Protocol

Our experiments use the the proportion of correct entities ranked in top 1,3 and 10 (Hits@1, Hits@3, Hits@10) and the mean reciprocal rank (MRR) as the metrics. In addition, since some corrupted triples exist in the knowledge graphs, we use the filtered setting [Bordes et al.2013], i.e. we filter out all valid triples before ranking.

Link Prediction

Our results on the standard FB15k-237, WN18RR and FB15k-237-Attr are shown in Table 3. Table 3 reports Hits@10, Hits@3, Hits@1 and MRR results of four different baseline models and two our models on three knowledge graphs datasets. The FB15k-237-Attr dataset is used to prove the efficiency of node attributes. So we run our SACN in FB15k-237-Attr to do the comparison with SACN using FB15k-237.

We first compare our Conv-TransE model with the four baseline models. ConvE has the best performance comparing all baselines. In FB15k-237 dataset, our Conv-TransE model improves upon ConvE’s Hits@10 by a margin of 4.1% , and upon ConvE’s Hits@3 by a margin of 5.7% for the test. In WN18RR dataset, Conv-TransE improves upon ConvE’s Hits@10 by a margin of 8.3% , and upon ConvE’s Hits@3 by a margin of 9.3% for the test. For these results, we conclude that Conv-TransE using neural network keeps the translational characteristic between entities and relations and achieve better performance.

Second, the structure information is added into our SACN model. In Table 3, SACN also get the best performances in the test dataset comparing all baseline methods. In FB15k-237, comparing ConvE, our SACN model improves Hits@10 value by a margin of 10.2%, Hits@3 value by a margin of 11.4%, Hits@1 value by a margin of 8.3% and MRR value by a margin of 9.4% for the test. In WN18RR dataset, comparing ConvE, our SACN model improves Hits@10 value by a margin of 12.5%, Hits@3 value by a margin of 11.6%, Hits@1 value by a margin of 10.3% and MRR value by a margin of 2.2% for the test. So our method has significant improvements over ConvE without attributes.

Third, we add node attributes into our SACN model, i.e. we use the FB15k-237-Attr to train SACN. Note that SACN has significant improvements over ConvE without attributes. Adding attributes improves performance again. Our model using attributes improves upon ConvE’s Hits@10 by a margin of 12.2% , Hits@3 by a margin of 14.3%, Hits@1 by a margin of 12.5% and MRR by a margin of 12.5%. In addition, our SACN using attributes improved Hits@10 by a margin of 1.9% , Hits@3 by a margin of 2.6%, Hits@1 by a margin of 3.8% and MRR by a margin of 2.9% comparing with SACN without attributes.

In order to better compare with ConvE, we also use the attributes into ConvE. Here the attributes will be treated as the entity triplets. Following the official ConvE code with default setting, the test result in FB15k-237-Attr was: 0.46 (Hits@10), 0.33 (Hits@3), 0.22 (Hits@1) and 0.30 (MRR). Comparing to the performance without the attributes, adding the attributes into the ConvE didn’t improve performance.

Figure 3: The convergence study of SACN, Conv-TransE models in FB15k-237 and SACN in FB15k-237-Attr (SACN + Attr) using the validation set. Due to the page limitation, only the results of Hits@1 and MRR are reported here.

FB15k-237
Hits
Model Kernel Size @10 @3 @1 MRR
Conv-TransE 2 1 0.504 0.357 0.234 0.324
Conv-TransE 2 3 0.513 0.365 0.240 0.331
Conv-TransE 2 5 0.512 0.361 0.239 0.329
SACN 2 1 0.527 0.379 0.255 0.345
SACN 2 3 0.536 0.384 0.260 0.351
SACN 2 5 0.536 0.385 0.261 0.352
SACN+Attr 2 1 0.535 0.384 0.260 0.351
SACN+Attr 2 3 0.543 0.394 0.268 0.360
SACN+Attr 2 5 0.547 0.396 0.268 0.360
Table 4: Kernel size analysis for FB15k-237 and FB15k-237-Attr datasets. “SACN+Attr” means the SACN using FB15k-237-Attr dataset.

Convergence Analysis

Figure 3 shows the convergence of the three models. We can see that the SACN (the red line) is always better than Conv-TransE (the yellow line) after several epochs. And the performance of SACN keeps increasing after around 120 epochs. However, the Conv-TransE has achieved the best performance after around 120 epochs. The gap between these two models proves the usefulness of structural information. When using the FB15k-237-Attr dataset, the performance of “SACN + Attr” is better than “SACN” model.

Kernel Size Analysis

In Table 4, different kernel sizes are examined in our models. The kernel of “” means the knowledge or information translating between one attribute of entity vector and the corresponding attribute of relation vector. If we increase the kernel size to “” where , the information is translated between a combination of attributes in entity vector and a combination of attributes in relation vector. The larger view to collect attribute information can help to increase the performance as shown in Table 4. All the values of Hits@1, Hits@3, Hits@10 and MRR can be improved by increasing the kernel size in the FB15k-237 and FB15k-237-Attr datasets. However, the optimal kernel size may be task dependent.


Conv-TransE SACN
Average Hits Average Hits
Indegree Scope @10 @3 @10 @3
[0,100] 0.192 0.125 0.195 0.134
[100,200] 0.441 0.245 0.441 0.253
[200,300] 0.696 0.446 0.705 0.429
[300,400] 0.829 0.558 0.806 0.577
[400,500] 0.894 0.661 0.868 0.663
[500,1000] 0.918 0.767 0.891 0.695
[1000, maximum] 0.992 0.941 0.981 0.922
Table 5: Node indegree study using FB15k-237 dataset.

Node Indegree Analysis

The indegree of the node in knowledge graph is the number of edges connected to the node. The node with larger degree means it have more neighboring nodes, and this kind of nodes can receive more information from neighboring nodes than other nodes with smaller degree. As shown in Table 5, we present the results for different sets of nodes with different indegree scopes. The average Hits@10 and Hits@3 scores are calculated. Along the increasing of indegree scope, the average value of Hits@10 and Hits@3 will be increased. First for a node with small indegree, it benefits from aggregation of neighbor information from the WGCN layers of SACN

. Its embedding can be estimated robustly. Second for a node with high indegree, it means that a lot more information is aggregated through GCN, and the estimation of its embedding is substantially smoothed among neighbors. Thus the embedding learned from

SACN is worse than that from Conv-TransE. One solution to this problem would be neighbor selection as in [Ying et al.2018].

Conclusion and Future Work

We have introduced an end-to-end structure-aware convolutional network (SACN). The encoding network is a weighted graph convolutional network, utilizing knowledge graph connectivity structure, node attributes and relation types. WGCN with learnable weights has the benefit of collecting adaptive amount of information from neighboring graph nodes. In addition, the entity attributes are added as the nodes in the network so that attributes are transformed into knowledge structure information, which is easily integrated into the node embedding. The scoring network of SACN is a convolutional neural model, called Conv-TransE. It uses a convolutional network to model the relationship as the translation operation and capture the translational characteristic between entities and relations. We also prove that Conv-TransE alone has already achieved the state of the art performance. The performance of SACN achieves overall about 10% improvement than the state of the art such as ConvE.

In the future, we would like to incorporate the neighbor selection idea into our training framework, such as, importance pooling in [Ying et al.2018] which takes into account the importance of neighbors when aggregating the vector representations of neighbors. We would also like to extend our model to be scalable with larger knowledge graphs encouraged by the results in [Ying et al.2018].

Acknowledgements

This work was partially supported by NSF grants CCF-1514357 and IIS-1718738, as well as NIH grants R01DA037349 and K02DA043063 to Jinbo Bi.

References