Neural Recurrent Structure Search for Knowledge Graph Embedding

11/17/2019 ∙ by Yongqi Zhang, et al. ∙ 0

Knowledge graph (KG) embedding is a fundamental problem in mining relational patterns. It aims to encode the entities and relations in KG into low dimensional vector space that can be used for subsequent algorithms. Lots of KG embedding models have been proposed to learn the interactions between entities and relations, which contain meaningful semantic information. However, structural information, which encodes local topology among entities, is also important to KG. In this work, we propose S2E to distill structural information and combine it with semantic information for different KGs as a neural architecture search (NAS) problem. First, we analyze the difficulty of using a unified model to solve the distillation problem. Based on it, we define the path distiller to recurrently combine structural and semantic information along relational paths, which are sampled to preserve both local topologies and semantics. Then, inspired by the recent success of NAS, we design a recurrent network-based search space for specific KG tasks and propose a natural gradient (NG) based search algorithm to update architectures. Experimental results demonstrate that the searched models by our proposed S2E outperform human-designed ones, and the NG based search algorithm is efficient compared with other NAS methods. Besides, our work is the first NAS method for RNN that can search architectures with better performance than human-designed models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

model semantic structural
PTransE add
S2E (NAS) a recurrent network searched by natural gradient descent
Table 1: Recurrent function of existing KG embedding models. “” is the elementwise multiply, and “” is the Hermitian product [34] in complex space.

is a non-linear activation function.

Knowledge Graph (KG) [3, 30, 36]

, as a special kind of graph with lots of relational facts, is important to data mining and machine learning. In KGs, each relational fact is represented as a triplet in the form of

(subject entity, relation, object entity), and is abbreviated as . KGs have gradually been employed in many knowledge-driven tasks, e.g., structured search [13], question answering [25] and recommendation [40]. However, the discrete representation of triplets can not support these applications well [36]. KG embedding methods, which map the discrete entities and relations into continuous vector spaces, have recently emerged and been developed as a promising direction serving this purpose [7, 16, 31, 36]. The vector representations, i.e. embeddings, can preserve information in original triplets and can be used together with other machine learning techniques [25, 40].

As a relational graph, the semantic information, which models the interactions between entities and relations, is the main focus in KG. When learning KG embedding, the most fundamental issue is how to preserve information in the relational triplets ’s. TransE [7], a representative embedding model, interprets the triplet as a translation from subject to object , i.e. the embeddings satisfy . Following works, such as TransD [19], ComplEx [34], ConvE [12], etc., also learn the embeddings based on the single triplet . PTransE [23] extends TransE with multi-hop relation translation. By using manually selected paths, PTransE can learn compositional relation patterns, but fail to explore the structural information. These models mainly focus on the semantic information.

As a kind of graph structured data, the structural information, which encodes local topology among entities, gradually arises attention. GCN-Align [37] learns cross lingual alignment by using graph convolutional network to leverage the structural information. However, leaving semantic information alone prevents this model from fully capturing properties in KG. More recently, [16] proposes recurrent skipping network (RSN) to process the relational paths with RNNs. By sampling paths along the KG and incorporating both entity and relation embeddings, RSN becomes a promising model to exploit structural information along with semantic information. However, it focuses more on learning the structural information along paths and does not adapts well to tasks where semantic information is more important.

Given a KG task or dataset, what compositional relation patterns it has and what structure it forms are diverse [36]. How to exploit the structural information and combine it with semantic information for a specific KG task is non-trivial. Inspired by the success of neural architecture search (NAS) (ElsKen et al. 2019). we propose S2E in this paper, which can automatically combine Structural and Semantic information for KG Embeddings. Similar as [16], we generate paths to extract both structural information and semantic information in KGs. Based on the paths, we define the path distiller, i.e. a recurrent network, to distill the structural information in the paths and combine it with the semantic information. By reviewing lots of human-designed KG embedding models, we set up the search space for the recurrent network, which is quite different from the general NAS for RNN problem. In order to search efficiently, we propose a natural gradient based search algorithm for the network architecture. Extensive experiments on entity alignment and link prediction tasks show the effectiveness of the designed search space and efficiency of the search algorithm. Our searched recurrent networks are novel and can well adapt to different KG tasks by more effectively combining the structural and semantic information. Contributions of our work are summarized as follows:

  • [parsep=0pt,partopsep=0pt]

  • We analyze the difficulty and importance of distilling structural information and combining it with semantic information for KGs. Based on the analysis, we formulate the above problem as an NAS problem which targets at searching recurrent networks specific to the KG domain.

  • We propose a domain-specific search space for the recurrent network. Different from searching RNN cells, the recurrent network in our space is specifically designed for KG tasks and covers many human-designed embedding models. This also throws light upon designing of future embedding models.

  • We identify the problems of adopting one-shot architecture search in our search space. The problems are rarely discussed in NAS literature. and motivate us to design a natural gradient based searching algorithm to give pseudo gradient for architecture search, which is much more efficient than existing derivative-free NAS methods.

  • We conduct experiments on entity alignment and link prediction tasks. Empirical experiments demonstrate that the searched models by S2E outperform human-designed ones, and the search algorithm is more efficient compared with other NAS baselines.

2 Related Works

Before we present the related work, we first give the notations used in this paper. We denote vectors by lowercase boldface, and matrix by uppercase boldface. Knowledge Graph is defined by the set of entities , relations and triplets . A single triplet represents a relation that links from the subject entity to the object entity . The embeddings in this paper are denoted as boldface letters of indexes, e.g. are embeddings of respectively. Vectors are denoted by lowercase boldface, and matrices by uppercase boldface.

(a) TransE.
(b) PTransE.
(c) RSN.
Figure 1: Graphical illustration of TransE, PTransE, RSN. TransE is a single-triplet model thus it is non-recurrent. PTransE recurrently processes the path without entity embeddings. RSN uses full RNN and a skip connection. Each edge is an identity mapping, except an explicit mark of weighted mapping with the weight matrix , where for the six edges. Activation functions of OP1 and OP2 are shown in the upper right corner of themselves.

2.1 Knowledge Graph Embedding

Knowledge graph (KG) embedding aims to embed entities and relations into low-dimensional vector space, while preserving relevant properties in original KG. Given a single triplet , TransE [7], and its following works TransD [19], ComplEx [34], ConvE [12], etc., interpret the interactions between embeddings and in different ways. PTransE [23] manually selects some composite relations with semantic reliability in KG and models the sequence of relations instead of the single triplet. However, these works focus on learning semantic information, i.e. interactions of entities with single or compositional relations.

Definition 1 (Relational Path [16]).

A path (of length ) is formed by a set of triplets where for all .

Structural information, which means the topology among entities, gradually arises attention in learning KG embeddings. GCN-Align [37], incorporates graph convolutional network (GCN) [21] to aggregate neighbors of each entity to capture structural information. However, ignoring semantic information among the triplets leads to their inferior performance. IPTransE [41] iteratively aligns entities in two KGs based on PTransE to implicitly leverage the internal structural information. More recently, recurrent skipping network (RSN) [16] is proposed to exploit long-term relational dependencies in KGs. The idea is to model relational paths (Definition 1), which contains both structural and semantic information, instead of the single triplet. By processing paths with a basic RNN and a skip connection, RSN incorporates the structural information well, but is still limited in learning semantic information. The expressions of several human-designed models are given in Tab. 1, and a graphical illustration of TransE, PTransE and RSN model is shown in Fig. 1. They have distinct connection patterns and composition operators, leading to different ways in dealing with structural and semantic information.

2.2 Neural Architecture Search (NAS)

NAS is a promising way to design reasonable neural network architectures for different tasks. It has two important perspectives

[42, 14, 18]: i). search space: it defines which architectures can be represented in principle like CNN or RNN. The search space should be carefully designed for specific scenarios and cover human wisdom as special cases. ii). search algorithm: universal optimization tools are generally inefficient to find good architectures. In order to make optimization fast, algorithms should be designed to utilize properties of the search space and domain information of the NAS problem.

The search space of CNN has been developed from searching basic convolution units for the whole architecture [42], to repeated cells with fixed macro architecture based on human-designed networks [24]. Many architectures have been searched to outperform CNNs in literature. However, designing the search space for RNN attracts little attention and the searched architectures do not outperform human-designed ones [24].

Recently, one-shot architecture search methods, e.g., DARTS [24], ASNG [1] and SNAS [39]

, have become the most popular NAS methods that can efficiently find good architectures. These methods directly construct a supernet, which contains all possible architectures spanned by selected operations, and jointly optimize network weights and architectures’ parameters by stochastic gradient descent. However, their success counts on two conditions. First, network weights can be shared, which means the performance of direct training the supernet is not bad

[39] and removing a good operation can deteriorate the supernet’s performance more than a bad operation [5]. Second, the validation measurement needs to be differentiable. Otherwise, even with differentiable relaxation to architectures, we cannot get gradient information to update architectures [1].

3 Search Problem

As discussed in Sec.2.1, structural information and semantic information are both important in KG, and they can be captured by relational paths (Definition 1). However, given a specific KG dataset or task, the interactions between entities and relations, and the degree distribution of entities are complex [29, 26, 36]. Designing a model that can adapt well to different tasks and datasets is non-trivial. Single triplet models, e.g., TransE and ComplEx, process each triplet in the paths individually, thus no structural information is utilized. PTransE aggregates relations rather than entities, and thus mainly focus on semantics. ChainR and RSN recurrently process the entities and relations along paths. However, they emphasize too much on the structural information and fail to learn the semantics well. Therefore, we are motivated to search the model architectures to adaptively distill structural information and combine it with semantic information.

3.1 Search Objective

The key problem now becomes how to leverage the relational path (Definition 1) to distill the structural information and to combine with semantic information. Based on the existing embedding models, and inspired by the success of RSN [16] as well as the fact that a path is a sequence of many triplets, which is similar as that a sentence is a sequence of many words [32], we model the paths as a recurrent function in Definition 2.

Definition 2 (Path Distiller).

A path distiller processes the embeddings of to recurrently. In each recurrent step , the distiller combines embeddings of , and a distillation of preceding information to get an output . The distiller is formulated as a recurrent function


where ’s are hidden state of recurrent steps and . The output should approach object entity .

Specifically, at each step , we focus on one triplet to preserve the semantic information. The subject entity embedding and relation embedding are the inputs to the model. We use hidden state to combine structural and semantic information along the path. Besides, we collect an output in each step and use it to predict object entity in the current step. Therefore, how to form the recurrent function for certain KG tasks is the key issue. As it is difficult to design an that can adapt well to various KG tasks and datasets and motivated by the recent success of NAS [14] in searching neural network architectures, we propose to search the recurrent function as a RNN. The network here is not a general RNN but one specific to KG embedding tasks. Therefore, searching is modeled as an NAS problem in Definition 3. We use to denote the architecture of the recurrent network .

Definition 3 (NAS Problem).

Let the training set be and validation set be . returns the embeddings trained on with , of which the architecture is . measure the performance of embeddings on . The problem is to find an architecture for the path distiller such that validation performance is maximized, i.e.,


where is the search space of (i.e., containing all possible architectures of ).

The training data is a set of paths. The validation and testing data , are task dependent. In next part, we will introduce how to design proper search space, which should both consider human wisdom and search efficiently.

3.2 Search Space

The human designed KG embedding models in Tab.1 have various forms. To deal with the structural information, the network should be able to model non-recurrent architectures covering TransE, ComplEx, architectures without entity embedding like PTransE, and fully recurrent architectures like ChainR and RSN, etc. To ensure different ways of processing semantic information, we need to determine how to use different combinator like adding, multiplying, and Hermitian product to make different transformations in the embedding space. Therefore, choosing an appropriate space for the distillers is not trivial. Based on the various models in Tab.1, we are motivated to design the search space in Fig. 2.

Figure 2: General search architecture of the recurrent network . OP1-OP3 are three combinators. OP1-OP2 are followed by an activation function. Blue and yellow dashed lines are the one-out-of-four selectors. Left hand side in this figure focuses more on structural information, while the right side is more about semantic.
  • [topsep=0pt,parsep=0pt,partopsep=0pt]

  • Connections: In the left part, we focus on structural information by controlling connections. First, a single operator OP1 is used to combine entity embedding with the distilled recurrent information . As discussed in Sec.3.1, we should deal with different connection types to distill the structural information. Specifically, considering non-recurrent structures like TransE, hidden state may not be involved in the architecture; OP1 can be unconnected to leave entity embedding alone like PTransE. Therefore, we select one out of four vectors, i.e. , , output of OP1 and , to combine with relation embedding in OP2, and to get the output in OP3.

  • Operations (Table 2): Once the connections are fixed, the operators, i.e. OP1, OP2 and OP3, should choose among different combination functions: adding, multiplying or Hermitian product. To further enhance the distiller’s learning ability, we add the gated operator GRU [10] as an additional combinator for OP1 and OP2. Followed by the combination functions, OP1 and OP2 choose activation functions from identity, tanh, sigmoid, relu

    to enable non-linear transformations.

operations activation functions
OP1 adding, multiplying, identity, tanh,
OP2 Hermitian product, GRU sigmoid, relu
OP3 adding, multiplying, Hermitian product identity
Table 2: Candidates of operations and activation functions for OP1-3.

Each edge is either a trainable square matrix

or an identity matrix

. To summarize, there are three or four candidate operators for OP1-OP3, four activation functions for OP1-OP2, and four different connections for OP2-OP3, resulting in about possible models. More details of the search space are given in Appendix B.

3.3 Comparison with NAS for RNN

When forming the recurrent network , a basic consideration here is to search an RNN cell using standard NAS methods [42]. Even though both the distiller here and RNN cells target at learning to control information flow between recurrent steps, there are quite a few differences between them (Tab. 3). First, the search space of RNN focuses more on different operators with fixed connections. However, the search space in Fig. 2 focuses on combining structural and semantic information in KGs. The search space in our work can be explicitly categorized to fit different connections, such as non-recurrent architecture, architectures without entity embedding, fully recurrent architectures, etc. The domain-specific considerations make our space better regularized and generalize many human-designed models. Besides, directly searching RNN cells has not shown advantage over human-designed models [42, 24] since it lacks domain understanding for specific tasks. However, S2E consistently finds better architectures for various KG tasks.

perspective NAS for RNN S2E
space general KG specific
algorithm one-shot NAS natural gradient
performance worse than human designed models consistently better
Table 3: Comparison of state-of-the-arts NAS methods for general RNN with the proposed S2E.

4 Search Algorithm

In previous section, we have introduced the search space , which contains tens of thousands of different models. Therefore, how to efficiently search in is an important problem. Designing appropriate optimization algorithm for the discrete architecture parameters is a big challenge.

4.1 Failure of One-Shot Architecture Search

One-shot architecture search has been widely used for NAS [24, 39, 1] by jointly optimizing the architecture parameters in a supernet, i.e., a network containing all possible architectures spanned by selected operations. Even though the problem of searching the recurrent function can be regarded as a special kind of NAS, i.e., RNN search problem, those are not applicable for the following two reasons:

  • [leftmargin = 22px]

  • validation performance measurement is not differentiable w.r.t architecture parameters ;

  • sharing embedding parameters leads to problematic gradients.

For the first point, training and evaluation are quite different in KG embedding models. In the training procedure, the objective is to maximize score on positive triplets and minimize for the negative ones, while during validation, the ranking metric [7], which is discrete and not differentiable, is generally used to measure the performance. Thus, algorithms that rely on differentiable relaxations to operations, like DARTS [24], SNAS [39], are not applicable in this case. For the second point, embedding parameter sharing is problematic. Different from NAS applications, the training samples are the learned embedding, and architecture performance highly depends on the status of embeddings (see Section 5.5).

4.2 Searching Architectures by Natural Gradient

As one-shot architecture search methods cannot be applied, we may turn to derivative-free optimization methods [11]

, such as reinforcement learning

[42, 4] and Bayes optimization [6]. However, they are generally too slow [11]. In this sequel, we propose a search algorithm based on natural gradient (NG) [2, 27], which guides more efficient search by generating pseudo gradient.

Let be the distribution of architectures . To enable the usage of NG, we first use stochastic relaxation [39, 1] to transform (2) into a maximization problem over , i.e.,


where the performance is averaged over the distribution of . Then, the NG updates by where is the step-size, is Fisher matrix at , and


where each architecture is sampled i.i.d. from , and is the number of samples to approximate the . The key observation in NG to avoid problems P1 is that we do not need to take gradient w.r.t. in (4), which is non-differentiable. Instead we only need to get validation performance, which is directly measured by . In this way, we can see (4) offers pseudo gradients of in a parameterized space by . Then, for problem P2, the embedded vectors are not shared across different ’s, instead it is directly obtained by model training, i.e., each is different in (4).

Besides, NG itself has several nice properties. As discussed in [2, 27], NG descent is parameterization-invariant, has good generalization ability, and moreover can be regarded as a second-order method in the space of parameters . Details of the NG descent used in our work are given in Appendix A. These facts together with NG’s ability of overcoming problem P1 and P2 make NG an ideal choice here. In Sec.5.4, we also show the proposed NG based search method is much efficient than other derivative-free optimization methods.

Remark 4.1.

NG has also recently been adopted for searching CNN [1]. However, it is still an one-shot architecture search method, which needs parameter sharing. Thus, it still fails here due to problem P2, which are empirically demonstrated in Sec. 5.5. Besides, it does not consider searching RNN.

H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR
semantic TransE 28.4 51.4 0.36 27.0 57.4 0.37 16.2 39.0 0.24 40.3 60.9 0.47
TransD* 27.7 57.2 0.37 17.3 41.6 0.26 21.1 47.9 0.30 24.4 50.0 0.33
PTransE 16.7 40.2 0.25 7.4 14.7 0.10 7.3 19.7 0.12 27.0 51.8 0.35
structural BootEA* 32.3 63.1 0.42 31.3 62.5 0.42 31.3 62.9 0.42 44.2 70.1 0.53
IPTransE* 23.1 51.7 0.33 22.7 50.0 0.32 25.5 55.7 0.36 31.3 59.2 0.41
ChainR 32.2 60.0 0.42 35.3 64.0 0.45 31.4 60.1 0.41 41.3 68.9 0.51
RSN* 38.8 65.7 0.49 40.0 67.5 0.50 34.7 63.1 0.44 48.7 72.0 0.57
NAS-RNN 38.3 65.1 0.47 37.4 67.1 0.47 33.4 62.8 0.43 46.0 72.4 0.54
S2E (proposed) 41.1 71.0 0.51 40.6 73.4 0.52 36.1 67.5 0.46 50.2 75.7 0.59
Table 4: Performance comparison on entity alignment tasks. H@ is short for Hit@. The results of TransD [19], BootEA [31], IPTransE [41] and RSN (Guo et al. 2019) are copied from (Guo et al. 2019).

5 Experiments

5.1 Experiment setup

Following [15, 16], we sample the relational paths from biased random walks (details in Appendix C.2). The paths make up the training set and will not be resampled for each datasets when evaluating different models.

We use two basic tasks in KG, i.e. entity alignment and link prediction. Same as the literature [7, 41, 16], we use the ranking based metrics: mean reciprocal ranking (MRR) and Hit@. MRR is computed by average of the reciprocal ranks , where is the validation or testing set and is a set of ranking results. Hit@ is the percentage of appearance in top-, namely , where is the indicator function. To avoid underestimating the different models, we report the performance in a “filtered” setting, i.e., all the correct pairs or triplets that exist in training, validation or testing set are filtered out [7]. All the models are compared with the same embedding dimension.

Experiments are written in Python with PyTorch framework

[28] and run on a TITAN Xp GPU. Training details of each task are given in Appendix C.3

5.2 Comparison with Human-designed models

Entity alignment

In this task, two KGs and are given. The two KGs have a set of entities that are aligned , and is split into training/validation/testing sets, During training, the paths can walk through the aligned entity pairs in training set to interact between and

. And we evaluate the cosine similarity of each

in validating set or testing set . We use the four cross-lingual and cross-database subset from DBpedia and Wikidata generated by [16]: DBP-WD, DBP-YG, EN-FR, EN-DE. Details are given in Appendix C.1.

Figure 3: Four patterns with different connection components within the recurrent search space. From left to right: P1 to P4.

In this task, structural information is more important since aligned entities usually have similar local topologies. The comparison of the testing performance of the models searched by S2E and human-designed ones is given in Tab. 4. We roughly categorize the human designed models into “semantic” and “structural” based on their main focus. We can observe that, models that leverage structural information generally perform better than those that only focus on semantics. BootEA and IPTransE win over TransE and PTransE respectively by iteratively aligning discovered entity pairs to implicitly leverage the structural information. Performance of PTransE is very bad since it emphasizes too much on learning compositional relations, and ignores entities’ topology along paths. ChainR, RSN and NAS-RNN outperform BootEA and IPTransE by explicitly processing both entities and relations along path. NAS-RNN wins over ChainR by adapting the RNN architectures but is inferior to RSN due to lack of domain-specific constraint. The searched models by S2E are given in Appendix C.4.

Link prediction

To further demonstrate the effectiveness of S2E, we do link prediction task which is a general testbed on single triplet models like TransE, ComplEx, SimplE, etc. In this task, an incomplete KG is given. Our target is to predict the missing entities in unknown link. The evaluation is conducted in the following ways. For each triplet in the validation set or testing set , we compute the score of for all and get the rank of , the same for based on scores of over all . We use two famous benchmark datasets, WN18-RR [12] and FB15k-237 [33] (statistics in Appendix C.1), which are more realistic than their superset WN18 and FB15k [7].

models WN18-RR FB15k-237
H@1 H@10 MRR H@1 H@10 MRR
TransE 12.5 44.5 0.18 17.3 37.9 0.24
ComplEx 41.4 49.0 0.44 21.7 47.5 0.30
PTransE 27.2 46.4 0.34 20.3 45.1 0.29
RSN 38.0 44.8 0.40 19.2 41.8 0.27
S2E 42.1 52.4 0.45 21.9 48.1 0.30
Table 5: Performance comparison on link prediction tasks.

Due to the high computation cost in this task, we compare different models with a small dimension size 64. As shown in Fig. 5, PTransE outperforms TransE by modeling compositional relations, but worse than ComplEx due to the adding operation is inferior to Hermitian product when modeling the interaction between entities and relations [34]. RSN is worse than ComplEx since it pays more attention on structures rather than semantics. And our S2E can search models that perform better than the human-designed models by adaptively searching the architectures.

5.3 Case Study

In this part, we use the synthetic data Countries [8], which contains 271 countries and regions, and 2 relations neighbor and locatedin. This dataset contains three tasks: S1 and S2 require inferring 2 hop relations, i.e. ; S3 is harder and requires modeling 3 hop relations .

In order to show the difficulty of designing a proper recurrent function for given tasks, we extract four patterns within the search space in Fig. 2 and individually search in each pattern to fit S1-S3. Specifically, as shown in Fig. 3,

  • [parsep=0pt,partopsep=0pt]

  • P1 represents the single hop embedding models;

  • P2 processes 2 successive steps;

  • P3 models long relational path without entity embedding;

  • P4 includes both entities and relations along path.

For each task in S1-S3, we randomly generate 100 models for each pattern and record the model with best area under curve of precision recall (AUC-PR) [9] on validation set. The procedure is repeated 5 times for each pattern to evaluate the testing performance given in Tab. 6. We can see that no single pattern perform well on all tasks. For easy tasks S1 and S2, semantic information is more important. Incorporating entity embeddings along paths like P4 will hinder the learning ability of directly modeling 2 hop relationships. The two-step pattern P2 consistently performs good in S1 and S2 since it fits best in the two tasks. For the harder task S3, P3 and P4 win over the others since it can model longer steps. P4 slightly outperform P3 by incorporating structural information along path.

S1 S2 S3
P1 0.9980.001 0.9970.002 0.9180.031
P2 1.0000.000 0.9990.001 0.9230.023
P3 0.9920.001 1.0000.000 0.9440.016
P4 0.9770.028 0.9840.010 0.9490.015
S2E 1.0000.000 1.0000.000 0.968 0.007
Table 6: Performance in Countries datasets.

We evaluate the best model searched in each task. As in the last line of Tab. 6, the model searched by S2E achieves good performance on the hard task S3. Besides, S2E prefers different candidates (see Appendix C.4) for S1-S3 over the same search space. This verifies our analysis that distilling structural information and combining it with semantic information is difficult, and there does not exist a unified model that can adapt different KG tasks and datasets well.

5.4 Comparison with NAS Methods

We compare the proposed algorithm and the other search algorithms here. Basically, we use random search (Rand) as the simplest baseline. Reinforcement learning (RL) has similar objective as (3

), but only uses first order gradient. Besides, we use tree parzen estimator (

TPE) [6] as a representative Bayes optimization method, which is widely used in searching architectures. Finally, our natural gradient based search algorithm is given as S2E.

We show the top1 and top5 mean MRR values w.r.t. the sampled model in Fig. 4

to show the effectiveness of natural gradient. Entity alignment task is evaluated on DBP-WD dataset, and link prediction uses WN18-RR. All the sampled models are trained into convergence and the MRR value of each model is the converged value on validation set. As shown, all heuristic algorithms, i.e. S2E, RL and TPE, outperform random search. Among them, S2E converges faster and performs better than the other searching algorithms as indicated by the blue lines. The reason is that, using REINFORCE gradient

[38] in RL will lead to suboptimal solution when is not well parameterized. TPE easily falls into local optimum without warm start training [6]. In comparison, the natural gradient we used is parameterization-invariant and can be regarded as a second-order method, thus performs the best.

(a) Entity alignment.
(b) Link prediction.
Figure 4:

Efficiency comparison of S2E with various NAS methods. Each model is evaluated with the same number of epochs. In entity alignment task, DBP-WD is used. In link prediction, WN18-RR is used.

5.5 Problem on Embedding Sharing

In this part, we empirically show the inherent problems by training a supernet with shared embedding and model parameters. When searching CNN architectures under the one-shot framework, parameters can be shared in the supernet such that the model can focus on the operations that are most useful for generating good predictions [5]. However, the input training set is fixed for all the operations in CNN, while the KG embeddings keep changing. As in Fig. 5(a), the distributions of entity embeddings trained by two different KG embedding models are rather distinct. Simultaneously updating the embeddings with different operations will lead to distorted performance as shown in Fig. 5.(b). Besides, it is not reliable to compare different operators when the embedding distribution is different. Combining the discussion in Sec. 4.1 and the empirical analysis in this part, we conclude that one-shot architecture search is not adoptable in our searching problem.

(a) Visualization by T-SNE.
(b) Validation performance.
Figure 5: Left: T-SNE [35] visualization of embeddings from RSN and PTransE. Right: validation performance of embedding shared model and training single model RSN.

6 Conclusion

In this paper, we propose an NAS method S2E to search RNN for learning from relational paths, which contains structural and semantic information, in KGs. By designing a specific search space based on human-designed KG embedding models, S2E can adaptively distill structural information and combine with semantic information for different KG tasks. Since the one-shot architecture framework is not applicable in this problem, we propose an NG based search algorithm that is efficient compared with the other baselines.


  • [1] Y. Akimoto, S. Shirakawa, N. Yoshinari, K. Uchida, S. Saito, and K. Nishida (2019) Adaptive stochastic natural gradient method for one-shot neural architecture search. In ICML, pp. 171–180. Cited by: §2.2, §4.1, §4.2, Remark 4.1.
  • [2] S. Amari (1998) Natural gradient works efficiently in learning. Neural Computation 10 (2), pp. 251–276. Cited by: §4.2, §4.2.
  • [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007) Dbpedia: a nucleus for a web of open data. In The semantic web, pp. 722–735. Cited by: §1.
  • [4] B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. In ICLR, Cited by: §4.2.
  • [5] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, pp. 549–558. Cited by: §2.2, §5.5.
  • [6] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In NeurIPS, pp. 2546–2554. Cited by: §4.2, §5.4, §5.4.
  • [7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In NeurIPS, pp. 2787–2795. Cited by: §1, §1, §2.1, §4.1, §5.1, §5.2.
  • [8] G. Bouchard, S. Singh, and T. Trouillon (2015) On approximate reasoning capabilities of low-rank vector spaces. In AAAI Spring Symposium Series, Cited by: §5.3.
  • [9] K. Boyd, K. H. Eng, and C. Page (2013)

    Area under the precision-recall curve: point estimates and confidence intervals

    In Joint ECML-KDD, pp. 451–466. Cited by: §5.3.
  • [10] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2015)

    Gated feedback recurrent neural networks

    In ICML, pp. 2067–2075. Cited by: §B.2, 2nd item.
  • [11] A. R. Conn, K. Scheinberg, and L. N. Vicente (2009) Introduction to derivative-free optimization. Vol. 8, Siam. Cited by: §4.2.
  • [12] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2017) Convolutional 2D knowledge graph embeddings. In AAAI, Cited by: §1, §2.1, §5.2.
  • [13] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In SIGKDD, pp. 601–610. Cited by: §1.
  • [14] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey.. JMLR 20 (55), pp. 1–21. Cited by: §2.2, §3.1.
  • [15] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In SigKDD, pp. 855–864. Cited by: §C.2, §5.1.
  • [16] L. Guo, Z. Sun, and W. Hu (2019) Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, Cited by: §C.1, §C.2, §C.3, §1, §1, §1, §2.1, §3.1, §5.1, §5.1, §5.2, Definition 1.
  • [17] D. Guthrie, B. Allison, W. Liu, L. Guthrie, and Y. Wilks (2006) A closer look at skip-gram modelling.. In LREC, pp. 1222–1225. Cited by: §C.2.
  • [18] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at Cited by: §2.2.
  • [19] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge graph embedding via dynamic mapping matrix. In ACL, Vol. 1, pp. 687–696. Cited by: §1, §2.1, Table 4.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §C.2.
  • [21] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Cited by: §2.1.
  • [22] T. Lacroix, N. Usunier, and G. Obozinski (2018)

    Canonical tensor decomposition for knowledge base completion

    In ICML, Cited by: §C.2.
  • [23] Y. Lin, Z. Liu, H. Luan, M. Sun, S. Rao, and S. Liu (2015) Modeling relation paths for representation learning of knowledge bases. Technical report arXiv:1506.00379. Cited by: §1, §2.1.
  • [24] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §2.2, §2.2, §3.3, §4.1, §4.1.
  • [25] D. Lukovnikov, A. Fischer, J. Lehmann, and S. Auer (2017) Neural network-based question answering over knowledge graphs on word and character level. In WWW, pp. 1211–1220. Cited by: §1.
  • [26] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2016) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §3.
  • [27] R. Pascanu and Y. Bengio (2013) Revisiting natural gradient for deep networks. Technical report arXiv preprint arXiv:1301.3584. Cited by: §4.2, §4.2.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In ICLR, Cited by: §5.1.
  • [29] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: §3.
  • [30] R. Socher, D. Chen, C. Manning, and A. Ng (2013) Reasoning with neural tensor networks for knowledge base completion. In NeurIPS, pp. 926–934. Cited by: §1.
  • [31] Z. Sun, W. Hu, Q. Zhang, and Y. Qu (2018) Bootstrapping entity alignment with knowledge graph embedding.. In IJCAI, pp. 4396–4402. Cited by: §1, Table 4.
  • [32] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In CISCA, Cited by: §3.1.
  • [33] K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In Workshop on CVSMC, pp. 57–66. Cited by: §5.2.
  • [34] T. Trouillon, C. Dance, E. Gaussier, J. Welbl, S. Riedel, and G. Bouchard (2017) Knowledge graph completion via complex tensor factorization. JMLR 18 (1), pp. 4735–4772. Cited by: §B.1, §B.1, Table 1, §1, §2.1, §5.2, Proposition 1.
  • [35] L. van der Maaten and G. E. Hinton (2008) Visualizing data using t-SNE. JMLR. Cited by: Figure 5.
  • [36] Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge graph embedding: a survey of approaches and applications. TKDE 29 (12), pp. 2724–2743. Cited by: §1, §1, §3.
  • [37] Z. Wang, Q. Lv, X. Lan, and Y. Zhang (2018) Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pp. 349–357. Cited by: §1, §2.1.
  • [38] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. MLJ 8 (3-4), pp. 229–256. Cited by: §5.4.
  • [39] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. In ICLR, Cited by: §2.2, §4.1, §4.1, §4.2.
  • [40] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma (2016) Collaborative knowledge base embedding for recommender systems. In SIGKDD, pp. 353–362. Cited by: §1.
  • [41] H. Zhu, R. Xie, Z. Liu, and M. Sun (2017) Iterative entity alignment via joint knowledge embeddings.. In IJCAI, pp. 4258–4264. Cited by: §2.1, Table 4, §5.1.
  • [42] B. Zoph and Q. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.2, §2.2, §3.3, §4.2.

Appendix A NG details

To get the optimization direction of architectures , we use natural gradient to generate pseudo gradient. The key point here is to use stochastic relaxation to give differentiable feedback. Specifically, we introduce a parametric family

of probability distributions defined on the architecture space

. Assume that for all , the distribution admits the density function w.r.t. the reference measure on . The density approaches the Dirac-Delta distribution and is differentiable w.r.t. . Then we define the stochastic relaxation of given as follows:

leading to the maximization problem in (3).

Since the limit of converges to a Dirac-Delta distribution , maximization of leads to the maximization of depending on the condition . The gradient is computed as

where a well-known log-trick is used in this deduction.

Then the NG algorithm updates by


where is the step-size, and is Fisher information matrix at . Usually, computing is time consuming. To reduce computation cost, the exponential family , where , and are known functions depending on the target distribution, is used to parameterize the distribution . For simplicity, we set and choose the expectation parameters . Then the gradient reduces to , and inverse Fisher information matrix is . In this work, since the architecture parameters are all categorical, for a component with categories, we use a dimensional vector to model the distribution . The probability of sampling the -category is and . Note that, the natural gradient is only used to update the first dimension of , and is a one-hot vector (without dimension ) of the category .

Appendix B Search Space Details

b.1 Hermitian product

The Hermitian product is defined on complex space same as in [34]. Denote two complex vectors , where and are real and image part respectively, and , the Hermitian product is

Based on above equation, we can give the Hermitian product in the real space. Given two real-valued vectors which have the same dimension size and , we can evenly split both vectors into two parts, i.e. and , where have the same dimension . Then


Thus, we get a -dimensional composition vector .

Proposition 1.

Given a triplet and the embeddings , let , then is the same as ComplEx [34].


Based on (6),


where the summation works along the embedding dimension. By replacing “1” with the real part of complex value and “2” with image part, (7) resembles the equation (7) in [34]. ∎

b.2 Space size reduction

As discussed, there are possible architectures. However, not all of them need to be trained and evaluated. We can filter out these combinations that are not promising.

Proposition 2.

When zero vector is connected to OP2 and combinator in OP2 is multiply or Hermitian product, then the output of OP2 will be .

It is easy to see that multiplying with will lead to zero output. Based on the introduction of Hermitian product in B.1, it will also output if and are combined with Hermitian product. In this case, will be removed and there will be no information propagated recurrently. This case is the same for OP3 since multiplying with will get .

Besides, the GRU [10] unit already contains weighted edge as well as activation functions. As a function specially designed for recurrent structures, OP2 is only allowed to choose GRU when OP1 or is connected. When or is connected to OP2, a recurrent architecture will not exist.

Based on above analysis, we manually avoid evaluating the architectures that are not promising as in Prop. 2 or not making sense as the non-recurrent connection of GRU.

Appendix C Experiment details

c.1 Datasets

The datasets we used for entity alignment task are subset of DBpedia and Wikidata. They are cross-lingual or cross-KG in DBpedia and Wikidata. Detailed statistics and informations are given in Tab. 7. They are the same as the Normal version, which is very sparse, in [16].

The data for link prediction is WN18-RR and FB15k-237, which are subset of WN18 and FB15k respectively. Compared with their superset, the two selected sets are more realistic.

data source # relations # triplets
DBP-WD DBpedia-English 253 38,421
Wikidata-English 144 40,159
DBP-YG DBpedia-English 219 33,571
YAGO3-English 30 34,660
EN-FR DBpedia-English 211 36,508
DBpedia-French 177 33,532
EN-DE DBpedia-English 225 38,281
DBpedia-German 118 37,069
Table 7: Statistics of the datasets we used for entity alignment. Each single KG contains 15,000 entities. There are 4,500 aligned entity pairs in training sets, and 11,500 pairs for evaluation. The first 10% pairs among the 11,500 are used for validation and the left for testing. We use “#” short for “number of”.
Dataset #entity #rels #train #valid #test
WN18-RR 40,943 11 86,835 3,034 3,134
FB15k-237 14,541 237 272,115 17,535 20,466
Table 8: Statistics of the datasets we used for link prediction. “rels” is short for “relations”.

c.2 Path sampling

Following [15, 16], we sample path from biased random walks. In conventional random walk, all the neighbors have the same probability to be sampled as the next step. In biased random walk, we sample the neighbor that can go deeper or go to another KG with larger probability. For single KG, we use one parameter to give the depth bias. Specifically, the next step has probability of to go to the neighbors that are two step away from the previous one, and probability of to go to other neighbors. Similarly, we have another parameter for two KGs to give cross-KG bias. The next step is encouraged to jump to another KG through the entity alignment pairs in . Since the aligned pairs are usually rare in the training set, encouraging cross-KG walk can learn better about the aligned entities in the two KGs.

Since the KGs are usually large, we sample two path for each triplet in . The length of paths is 7 for entity alignment task on two KGs and 3 for link prediction task on single KG. The paths are fixed for each dataset to keep a fair comparison among different models. The parameters used for sampling path are summarized in Tab. 9.

parameters length
entity alignment 0.9 0.9 7
link prediction 0.7 3
Table 9: Parameter used for sampling path.

Once the paths are sampled, we start training based on paths. The loss for each path is given in (8)


In each recurrent step , we focus on one triplet . The subject embedding and relation embedding process along with the distilled information to get the output . The output is encouraged to approach the object embedding . Thus, in (8), the objective is a multi-class log-loss while the object is the true label. Rather than use the skip-gram model [17] based on negative samples, the multi-class loss introduced by [22] is more stable. Training is based on mini-batch gradient descent. Adam [20] is used as the optimization method for updating model parameters.

c.3 Hyperparameters and training details

We use RSN [16] as a standard baseline to tune parameters. We use the HyperOpt package, which is based on tree parsen estimator, to search the learning rate , L2 penalty , decay rate , batch size , as well as a dropout rate , which applies on input embeddings. The tuning ranges are given in Tab. 10.

hyper-param ranges
Table 10: Searching range of hyper-parameters

The embedding dimension for entity alignment task is 256, and for link prediction is 64. After the hyper-parameter tuning, the proposed S2E starts searching with this hyper-parameter setting. For all the tasks and datasets, we search and evaluate 100 models. Among them, we select the best model indicated by the MRR performance on validation set and fine-tune hyper-parameters for this model.

As for the searching time, about half GPU day for tuning hyper-parameters, two GPU days for searching the model in entity alignment datasets. Link prediction takes longer time since the evaluation process is more expensive. It takes one GPU day for hyper-parameters, two GPU days for WN18-RR and three GPU days for FB15k-237. Note that, the cost for searching a better model is in the same order of magnitude as tuning hyper-parameters. This means our search space is reasonable and the search algorithm is efficient.

c.4 Searched models

Entity alignment

The searched models by S2E, which consistently perform best on each dataset, are graphically illustrated in Fig. 6. As shown, all of the searched recurrent networks processing recurrent information in , entity embedding and relation embedding together. But they have different connections, different composition operator and different activation functions. The functions learned on DBP-WD and EN-FR, joint two successive triplets to learn a small neighborhood, while the other two process longer steps.

(a) DBP-WD
(b) DBP-YG
(c) EN-FR
(d) EN-DE
Figure 6: Graphical representation of the searched recurrent network on each datasets in entity alignment task.

Link prediction

The best models searched in link prediction tasks are given in Fig 7. And we make a statistics of the distance, i.e. the shortest path distance when regarding the KG as a simple undirected graph, between two entities in the validation and testing set in Tab. 11. As shown, most of the triplets have distance less than 3. Besides, as indicated by the performance on the two datasets, we infer that triplets far away from 3-hop are very challenging to model. At least in this task, triplets that are less or equal than 3 hops are the main focus of different models. This also explains why RSN, which processes long paths, does not perform well in the link prediction task. The searched models in Fig. 7 do not consider long term structures. The architecture on WN18-RR models two successive triplets, and the architecture on FB15k-237 only models single triplet. They focus more on the semantics and use gradient descent to interact among triplets.

(a) WN18-RR
(b) FB15k-237
Figure 7: Graphical representation of the searched on each datasets in link prediction task.
Datasets Hops
2 3
WN18-RR validation 35.5% 8.8% 22.2% 33.5%
testing 35.0% 9.3% 21.4% 34.3%
FB15k-237 validation 0% 73.2% 26.1% 0.7%
testing 0% 73.4% 26.8% 0.8%
Table 11: Percentage of the -hop triplets in validation and testing datasets.


We also give the graphical illustration of architectures searched in Countries dataset in Fig. 8. The search procedure is conducted in the whole search space rather than the four patterns P1-P4. We can see that in Fig. 8, (a) belongs to P1, (b) belongs to P2 and (d) belongs to P4. There results further verify that S2E can adaptively search architectures for specific KG tasks and datasets.

(a) S1
(b) S2
(c) S3
Figure 8: Graphical representation of the searched on countries dataset.