AutoKGE: Searching Scoring Functions for Knowledge Graph Embedding

04/26/2019 ∙ by Yongqi Zhang, et al. ∙ The Hong Kong University of Science and Technology 0

Knowledge graph embedding (KGE) aims to find low dimensional vector representations of entities and relations so that their similarities can be quantized. Scoring functions (SFs), which are used to build a model to measure the similarity between entities based on a given relation, have developed as the crux of KGE. Humans have designed lots of SFs in the literature, and the evolving of SF has become the primary power source of boosting KGE's performance. However, such improvements gradually get marginal. Besides, with so many SFs, how to make a proper choice among existing SFs already becomes a non-trivial problem. Inspired by the recent success of automated machine learning (AutoML), in this paper, we propose automated KGE (AutoKGE), to design and discover distinct SFs for KGE automatically. We first identify a unified representation over popularly used SFs, which helps to set up a search space for AutoKGE. Then, we propose a greedy algorithm, which is enhanced by a predictor to estimate the final performance without model training, to search through the space. Extensive experiments on benchmark datasets demonstrate the effectiveness and efficiency of our AutoKGE. Finally, the SFs, searched by our method, are KG dependent, new to the literature, and outperform existing state-of-the-arts SFs designed by humans.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Knowledge Graph (KG), as a special kind of graph structure with entities as nodes and relations as directed edges, has gained lots of interests recently. In KGs, each edge is represented as a triplet with form (head entity, relation, tail entity), which is denoted as , to indicate that two entities (i.e., head) and (i.e., tail) are connected by a relation , e.g. (NewYork, isLocatedIn, USA) (Singhal, 2012). A number of large scale KGs are established in last decades, such as WordNet (Miller, 1995), Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), YAGO (Suchanek et al., 2007). They have improved various downstream applications, e.g., structured search (Singhal, 2012; Dong et al., 2014), question answering (Lukovnikov et al., 2017), and entity recommendation (Zhang et al., 2016).

In KG, a fundamental issue is how to quantize the similarity of a given triplet such that subsequent applications can be performed (Getoor and Taskar, 2007; Wang et al., 2017b). Recently, knowledge graph embedding (KGE) has been emerged and developed as a promising method serving this purpose (Nickel et al., 2011; Lao et al., 2011; Yang et al., 2015; Wang et al., 2014; Ji et al., 2015; Nickel et al., 2016b; Dettmers et al., 2017; Trouillon et al., 2017; Xue et al., 2018; Kazemi and Poole, 2018; Lacroix et al., 2018; Zhang et al., 2018). Basically, given a set of observed facts (triplets), KGE attempts to learn low-dimensional vector representations of entities and relations so that similarities of the triplets can be quantized. Specifically, such similarity is measured by a scoring function (SF), which returns a score for based on the embeddings. Usually, SF is designed and chosen by humans and is also the most important perspective of KGE, which can significantly affect embeddings’ quality (Nickel et al., 2016a; Wang et al., 2017b; Lin et al., 2018).

To generate high-quality embeddings, SFs should enjoy both scalability and expressiveness (Trouillon et al., 2017; Kazemi and Poole, 2018; Wang et al., 2017a). Scalability means the parameters of a KGE model based on a SF should grow no faster than linearly w.r.t. the number of entities and relations, while expressiveness requires that a SF being able to handle common relations in KGs, i.e., symmetric (Yang et al., 2015), anti-symmetric (Liu et al., 2017), general asymmetric (Trouillon et al., 2017) and inverse relation (Kazemi and Poole, 2018). Since the invention of KGE, many SFs are proposed in literature. Examples are TransE (Bordes et al., 2013) and its variants, such as TransH (Wang et al., 2014), TransR (Lin et al., 2015), TransD (Ji et al., 2015), etc., which belong to translational distance models (TDMs) and model the similarity using projected distance in vector space; RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2017), Analogy (Liu et al., 2017), and more recently proposed SimplE (Kazemi and Poole, 2018)

, which fall into bilinear models (BLMs) and express the similarity based on bilinear functions. Besides, inspired by the success of neural networks on image classification and word representation

(Bengio et al., 2013), neural network models (NNM) have also been explored as SFs, e.g., MLP (Dong et al., 2014), NTM (Socher et al., 2013) and ConvE (Dettmers et al., 2017), etc. Among existing SFs, BLM-based ones, e.g., CompEx (Trouillon et al., 2017) and SimplE (Lacroix et al., 2018; Kazemi and Poole, 2018), are very powerful as indicated by state-of-the-arts results and some theoretical proofs. Besides, they share both scalability and expressiveness (Wang et al., 2017a). However, the performance boosted by human designed new SFs get gradually marginal. This mainly dues to different KGs have distinct patterns in relations (Paulheim, 2017), so that a SF adapts well to one KG may not perform consistently on the other KGs. Besides, with so many SFs, how to choose the best one for the a KG itself has already become a non-trivial problem.

Recently, the automated machine learning (AutoML) (Yao and Wang, 2018)

, has exhibited its power in many machine learning tasks, e.g., searching better neural network architecture for deep learning models

(Zoph and Le, 2017; Baker et al., 2017)

and configuring good ensemble of out-of-box classifiers

(Feurer et al., 2015). In this paper, inspired by the success of these applications, we propose automated KGE (AutoKGE), which can automatically search a SF for a given KG. The proposed AutoKGE is the first AutoML work presented in KG literature, it can simultaneously reduce humans’ efforts in designing new SFs and construct distinct SFs that adapt to specific KGs. However, it is not easy to fulfill these goals. To achieve them, we make following contributions:

  • [leftmargin=8pt]

  • First, we make important observations among existing SFs falling into BLMs, which helps us represent the BLM-based SFs in a unified form. Based on the unified representation, we setup the search space for AutoKGE. Such space is not only specific enough to cover good SFs designed by humans, but also general enough to include many novel SFs not visited in the literature.

  • Second, we observe it is common that different KGs have distinct properties on relations that are symmetric, anti-symmetric and general asymmetric, etc. This inspires us to conduct domain-specific analysis on the KGE models, and design constraints on expressiveness to effectively guide subsequent searches in the space.

  • Third, we propose a progressive greedy algorithm to search through such space. We further build a predictor with specifically designed symmetric-related features (SRFs) to avoid training SFs, which are unlikely to have good performance. Due to the designed SRFs, the predictor can accurately capture the expressiveness of candidates SFs, thus is able to significantly cut down model training times.

  • Finally, we conduct experiments on five popular datasets: WN18, FB15k, and their variants WN18RR and FB15k237, as well as YAGO3-10. Empirical results demonstrate that SFs searched by AutoKGE can outperform start-of-the-arts SFs, which are designed by humans. Besides, the searched SFs, are also KG dependent and new to the literature.


For a KG, its entity and relation set are given by and respectively. A triplet in the KG is given by , where and are indexes of the head and tail entity, and is the index of the relation. Embedding parameters of a KGE model are given as for each entity and for relation. For simplicity, the embeddings in this paper are represented by boldface letters of indexes, e.g., , are embeddings of , and are indexed from . is the embedding of . is the dot product and is equal to for real-valued vectors, and is the Hermitian product for complex-valued vectors as in (Trouillon et al., 2017). The diagonal matrix is constructed with elements in . Finally, is the scoring function (SF), which returns a real value reflecting the similarity for triplet , and the higher score indicates more similarity.

2. Related Works

2.1. The Framework of KGE

Given a set of observed (positive) triplets, the goal of KGE is to learn low-dimensional vector representations of entities and relations so that similarity measured by of observed triplets are maximized while those of non-observed ones are minimized (Wang et al., 2017b). To build a KGE model, the most important thing is to design and choose a proper SF , which exactly is used to measure such similarities. Since different SFs have their own weaknesses and strengths in capturing the similarity, the choice of is critical for the KGE’s performance (Wang et al., 2017b; Lin et al., 2018). A large amount of KGE models with popular SFs, follow the same framework (Wang et al., 2017b)

, which is based on stochastic gradient descent (SGD) and shown in Algorithm 

1. Another few ones use multi-class loss w.r.t. all the entities (Dettmers et al., 2017; Lacroix et al., 2018), which requires much larger memory.

Algorithm 1 contains two main parts, i.e., negative sampling (step 5) and embedding updates (step 6). As there are only positive triplets in the training data, negative sampling is used to find some negative triplets from all non-observed ones for the current positive triplet . This is usually done by some fixed distributions (Wang et al., 2014) or dynamic sampling schemes (Zhang et al., 2018). After that, the gradients, which are computed based on the given SF and embeddings, are used to update model parameters.

2.2. Scoring Functions (SFs)

0:  training set , scoring function ;
1:  initialize the embedding parameters for each and .
2:  for  do
3:     sample a mini-batch of size ;
4:     for each  do
5:         sample negative triplets for the positive triplet ;
6:         update parameters of embeddings w.r.t. the gradients using
7:     end for
8:  end for
9:  return  embeddings of and ;
Algorithm 1 General framework of KGE (Wang et al., 2017b).

Ever since the invention of KGE, the evolving of SFs has become the main power in boosting KGE’s performance (Wang et al., 2017b; Lin et al., 2018). SFs should be designed with good expressive ability so that it can handle common relations in real applications. Besides, its parameters should grow linearly with number of entities and relations in order to make KGE scalable (Trouillon et al., 2017). Existing human-designed SFs mainly fall into three types:

  • [leftmargin=8pt]

  • Translational distance models (TDMs): The translational approach exploits the distance-based SFs. Inspired by the word analogy results in word embeddings (Bengio et al., 2013), the similarity is measured based on the distance between two entities, after a translation carried out by the relation. In TransE (Bordes et al., 2013), the SF is defined by the (negative) distance between and , i.e., . Other TDMs-based SFs, e.g., TransH (Wang et al., 2014), TransR (Fan et al., 2014) and TransD (Ji et al., 2015), enhance over TransE by introducing extra mapping matrices.

  • BiLinear models (BLMs): SFs in this group exploit the similarity of a triplet by the product-based similarity. Generally, they share the form as , where is a matrix which refers to the embedding of relation (Wang et al., 2017b, a). RESCAL (Nickel et al., 2011) models the embedding of each relation by directly using . DistMult (Yang et al., 2015) overcomes the overfitting problem of RESCAL by constraining to be diagonal. ComplEx (Trouillon et al., 2017) allows and to be complex values, which enables handling asymmetric relations. HolE (Nickel et al., 2016b) uses a circular correlation to replace the dot product operation, but is proven to be equivalent to ComplEx (Hayashi and Shimbo, 2017). Analogy (Liu et al., 2017) constrains to be normal and commutative, and is implemented with a weighted combination of DistMult and ComplEx. Finally, SimplE (Kazemi and Poole, 2018) uses two groups of embedding for each entity and relation to deal with the inverse relations.

  • Neural network models (NNMs):

    Neural models aim to output the probability of the triplets based on neural networks which take the entities’ and relations’ embeddings as inputs. MLP proposed in

    (Dong et al., 2014) and NTN proposed in (Socher et al., 2013) are representative neural models. Both of them use a large amount of parameters to combine entities’ and relations’ embeddings. ConvE (Dettmers et al., 2017)

    takes advantage of convolutional neural network to increase the interaction among different dimensions of the embeddings.

Currently, BLM is the best among above three types of models (Kazemi and Poole, 2018; Wang et al., 2017a; Lacroix et al., 2018). TDMs, as proved in (Wang et al., 2017a), have less expressive ability than BLMs, which further lead to inferior empirical performance as well. Inspired by the success of deep networks, NNMs are also developed for KGE. However, due to huge model complexities and increasing difficulties on training, they are hard to train to achieve good performance (Kazemi and Poole, 2018; Dettmers et al., 2017; Lacroix et al., 2018). Thus, we focus on BLMs in the sequel.

2.3. Automated Machine Learning (AutoML)

To ease the usage of and design better machine learning models, the automated machine learning (AutoML) has become a hot topic with both academic interests and industrial needs (Yao and Wang, 2018). Representative works are neural architecture search (NAS) (Zoph and Le, 2017; Baker et al., 2017) and auto-sklearn (Feurer et al., 2015). These works have achieved exciting breakthroughs recently. NAS has identified networks with fewer parameters but better performance than that of the best network designed by humans. Besides, auto-sklearn can effectively and quickly find proper hyper-parameters, which previously needs great efforts from humans’ fine-tuning. This paper is the first step towards automated embedding of knowledge graphs. However, such step is not trivial as previous AutoML methods like NAS and auto-sklearn, cannot be applied to KGE. The main problem is that we need to explore domain-specific properties here in order to achieve good performance.

3. Problem Definition

As mentioned in Section 2, new designs of SFs have continuously boosted the performance of KGE in recent years. However, such improvements also consequently get marginal. Besides, as different KGs can exhibit distinct patterns in relations, how to choose a proper SF and tune it with good performance is also a non-trivial task. These raise one question: can we automatically design a SF for a given KG with good performance guarantee?

Here, we observe important patterns among SFs falling into BLMs in Section 3.1 by viewing state-of-the-arts SFs (Yang et al., 2015; Trouillon et al., 2017; Liu et al., 2017; Kazemi and Poole, 2018) in a unified representation. This subsequently helps us to setup the search space and constraints for searching SFs in Section 3.2.

scoring function embeddings definition
DistMult (Yang et al., 2015)
ComplEx (Trouillon et al., 2017) Re
Analogy (Liu et al., 2017) , + Re
SimplE (Kazemi and Poole, 2018) , +
Table 1. Existing SFs covered by our unified representation. For Analogy and SimplE, the embedding splits into two parts, i.e., where , and with (same for and ). is the complex conjugate of .

3.1. The Unified Representation

To motivate a unified representation over existing SFs, let us first look at four commonly used ones shown in Table 1, which belong to BLMs. We leave alone RESCAL (Nickel et al., 2011) since it neither empirically performs well nor has good scalability (Liu et al., 2017; Trouillon et al., 2017). Recall that, as discussed in Section 2.2, the state-of-the-arts performance is achieved by BLMs (Lacroix et al., 2018; Kazemi and Poole, 2018), thus we limit our scope to them.

From Table 1, we can see that there are two main differences among these SFs. First, the embedding can be either real or complex, e.g., DistMult v.s. ComplEx. Second, when embedding vectors are split, different SFs combine them in distinct manners, e.g., Analogy v.s. SimplE. The first problem can be addressed by using -dimensional real vectors to replace the -dimensional complex vectors (Trouillon et al., 2017). Let where (same for and ), then we can express ComplEx as

Based on such expansion, to further deal with the second problem, we partition embedding as (same for and ). Thus, each SFs can be reformed as

The above combinations are equivalent to the bilinear function with different form of relation matrix . Let for ,,,. The forms of for SFs in Table 1 are represented in Figure 1

. Blank space is for zero matrix.

(a) DistMult.
(b) ComplEx.
(c) Analogy.
(d) Simple.
Figure 1. A graphical illustration of for existing SFs in Table 1 under the proposed unified representation.

Viewed in this form, we can see that the main difference between for the four SFs is their way to fill the diagonal matrices into a block matrix. Totally, there are 16 blocks, and each is chosen from zero matrix , or , or their minus counterpart. Thus, we can see while , , and appear in all these SFs, they present different structures. Based on such pattern, we identify a unified representation of SFs as follow:

Definition 3.1 (Unified Representation).

Let return a block matrix, of which the elements in each block is given by where for . Then, SFs can be represented as


3.2. AutoKGE: Searching for SFs

The unified representation in Definition 3.1 makes up the search space of SFs, i.e., the block matrix. Thus, we can see that designing a good SF is equivalent to finding a good point in such search space. Since SF is the crux to KGE, we define the problem of AutoKGE as follow:

Definition 3.2 (AutoKGE).

Let be a KGE model (with indexed embeddings , and ), measure the performance (the higher the better) of a KGE model on a set of triplets . The problem of searching the SF is formulated as:


where contains all possible choices of , and denotes train and valid sets.

As shown in (2), there are two levels of optimization. First, we need to train the KGE model and obtain on the training set , and then measure the performance of current structure on the validation set . Besides, there are possible structures for (namely the size of ), which is prohibitively large. All these make searching a proper a difficult problem.

3.2.1. Expressiveness: constraints on the search.

Expressiveness (Trouillon et al., 2017; Kazemi and Poole, 2018; Wang et al., 2017a), which means must be able to handle common relations in KGs, is a big concern for SFs. Generally, there are four common relations popularly discussed in the literature, i.e., symmetric (Yang et al., 2015), anti-symmetric (Trouillon et al., 2017; Nickel et al., 2016b), general asymmetric (Liu et al., 2017; Dettmers et al., 2017) and inverse relation (Kazemi and Poole, 2018). Their meanings and consequent requirements on and are summarized in Table 2. To ensure that can handle those common relations, we propose Proposition 1.

Proposition 1 ().

If can be symmetric, i.e., , for some

and skew-symmetric, i.e.,

, for some . Then, is expressive.

With such Proposition, we introduce following constraint on :

  • [leftmargin=25pt]

  • is able to be both symmetric () with proper and skew-symmetric () with proper .

Besides, to avoid trivial solutions, we also need:

  • [leftmargin=25pt]

  • has no zero rows/columns;

  • covers all to ; and

  • has no repeated rows/columns.

common relations requirements on requirements on examples from WN18/FB15K
symmetric IsSimilarTo, Spouse
anti-symmetric LargerThan, IsPartOf
general asymmetric LocatedIn, Profession
inverse Hypernym, Hyponym
Table 2. Common relations in KGs and their resulting requirements on SFs and search candidates in .

Basically, for (C2), if there are zero rows/columns in , then the corresponding dimensions in or are useless, which will never be optimized if the entity only appear in head or tail in . Same for (C3), if some s are not contained in , then the corresponding s are not used. Finally, for (C4), if two rows/columns have the same content, then the two lines will be equivalent. This will reduce the effectiveness of parameters for the embeddings. The above Constraints (C1)-(C4) are important for finding a good candidate of , and they play a key role in cutting down the search space in order to design an efficient search algorithm.

3.2.2. Discussion: domain-specific AutoML.

While the AutoKGE (Definition 3.2) problem follows the same principles as the NAS (Xie and Yuille, 2017; Zoph and Le, 2017; Baker et al., 2017) and auto-sklearn (Feurer et al., 2015), i.e., both high-level configuration search and low-level model training are involved, the search space is fundamentally different. Specifically, in NAS the search space is made by the design choices of neural network architectures; in auto-sklearn the search space is comprised by hyper-parameters of out-of-box classifiers. The search space identified in AutoKGE, i.e., choices of in Definition 3.2, is motivated by our unified representation over commonly used BLM-based SFs in Definition 3.1. Besides, we clarify the importance of ensuring expressiveness in the search space, which are rooted in KGE’s applications. These are all specific to KGE’s domain and new to the AutoML literature.

4. Search Strategy

In Section 3, we have defined the AutoKGE problem, of which the crux is searching for SFs. In this Section, we propose an efficient search strategy to address the AutoKGE problem.

4.1. Challenges: Invariance and Expressiveness

Same as other AutoML problems, e.g., NAS and auto-sklearn, the search problem of AutoKGE is black-box, the search space is very huge, and each step in the search is very expensive as model training and evaluation are involved. These problems have previously been touched by algorithms including reinforcement learning

(Zoph and Le, 2017; Baker et al., 2017), Bayes optimization (Feurer et al., 2015)

and genetic programming

(Xie and Yuille, 2017). However, they are not applicable here. The reason is that we have extra domain-specific problems in KGE, which are more challenging.

As in Figure 2, first, we need to deal with invariance properties in the search space, which means that changing the order of or flipping the sign of to may result in the same learning performance. Second, while Constraints (C2)-(C4) are easy to handle, it is difficult to collaborate with Constraint (C1) as in Figure 3, The reason is that only presents a structure but the exact check of Constraint (C1) relies on the values in which are unknown in advance.

(a) example.
(b) permutation.
(c) flip sign.
Figure 2. Invariance on permutation and flipping signs.
(a) symmstric.
(b) anti-symmetric.
Figure 3. The ability to model symmetric relations when and anti-symmetric relations when .

In this work, we propose a progressive greedy search algorithm to handle the Constraints (C1)-(C4). To avoid model training, the proposed algorithm is further enhanced with a predictor. Besides, as acquiring training data for the predictor is expensive, we carefully design features, which are effective, sample efficient, and can overcome invariance property in the search space, extracted from the structure of to further address Constraint (C1).

4.2. Progressive Greedy Search

In this section, we propose a progressive greedy algorithm to explore the search space of in Definition 3.2. Let the number of nonzero blocks in be and SF with nonzero multiplicative terms be . The idea of progressive search is that given the desired , we start from small blocks and then gradually add in more blocks until . Recall that, in Section 3.1, adding one more block into indicates adding one more nonzero multiplicative term into , i.e., where and . In order to deal with Constraint (C1), we increase by in each step to avoid trivially lying on the diagonal. Moreover, due to Constraint (C2)-(C4), we have to start from to include all the to . Therefore, when , s can be generated following Algorithm 2. Basically, is picked up from candidate set , which greedily collects top ranking s; and new is generated at step 3; the uniqueness and consistency with Constraints (C2)-(C4) are ensured at step 4-6. Finally, is returned when new candidates are found.

0:  the number of candidate models ;
1:  while size of is smaller than  do
2:     randomly select a model from ;
3:     generate 6 integers and from , and form
4:     if , and meets Constraints (C2)-(C4) then
5:         ;
6:     end if
7:  end while
8:  return  .
Algorithm 2 Candidates generation ()

Based on Algorithm 2, Algorithm 3 shows our greedy algorithm. Since there are only a few candidates for under Constraints (C2)-(C4), we train and evaluate all of them at step 2 in Algorithm 3. When , possible candidates based on candidate sets are generated at step 7, and are selected by the predictor in step 8. The predictor helps to avoid training poor SFs from . As a result, only top- according to will be remained and evaluated with model training. In this way, the predictor can significantly reduce model training’s times and speedup the whole search algorithm. Therefore, how fast and accurate the predictor can be is key to the success of Algorithm 3. Moreover, training data for is gradually collected based on the recorded SFs at step 3 and 10, and is incrementally updated at step 12 as grows.

0:  : number of nonzero blocks in ;
1:  if  then
2:     generate, train and evaluate all models (24 in total);
3:      record all the and scores;
4:      keep with top- scores;
5:  else
6:     for  in  do
7:         generate a set of candidates using Algorithm 2;
8:         select top- ones in based on the predictor ;
9:         train and evaluate the selected models in parallel;
10:          record and scores;
11:          keep with top- scores in ;
12:         extract features of each (see details in Appendix LABEL:app:srf) and update the predictor using stochastic gradient descent;
13:     end for
14:  end if
15:  return  desired SFs in .
Algorithm 3 Progressive greedy search algorithm.

The search space of step 3 in Algorithm 2 is about . Under Constraints (C2)-(C4), there will remain approximately possible candidates. Besides, the size of is as indicated by step 4 and 11 in Algorithm 3. Thus, the search space size for under greedy algorithm is about . In comparison, the search space size of random search is . Thus, the greedy strategy can significantly cut down the search space. However, searching from the possible candidates is still far from efficient. Therefore, we introduce a predictor in next section to further improve efficiency in selecting promising SFs and design features by important properties observed in the search space.

4.3. Candidates Selection Using the Predictor

Basically, once a SF is given, the performance on a specific KG is determined under fixed set of hyper-parameters and training procedures. In other words, the performance is closely related to how the SF is formed. Considering that the process of training a SF through Algorithm 1 and collecting its performance takes time, we use a learning model to predict the performance and select the promising SFs without the demanding model training in Algorithm 1. In this section, we show how to effectively design the predictor.

4.3.1. Design principles.

The idea of using performance predictor is not new, it has been recently explored in algorithm selection (Eggensperger et al., 2015), networks’ performance prediction (Liu et al., 2018) and hyper-parameter optimization (Feurer et al., 2015). In general, we need to extract features for points which have been visited by the search algorithm, and then build a learning model to predict validation performance based on those features. The following are fundamental principles a good predictor needs to meet

  • [leftmargin=25pt]

  • Correlate well with true performance: the predictor needs not to accurately predict the exact values of validation performance, instead it should be able to rank good candidates over bad ones;

  • Learn from small samples: as the real performance of each point in the search space is expensive to inquiry, the complexity of the predictor should be low so that it can learn from few samples.

Specifically, Principle (P1) means that we need to extract features which reflect the unique properties in the search space; and Principle (P2) indicates that we cannot extract too many features, and simple learning models should be used with extracted feature. Therefore, how to design a good predictor is not a trivial task.

4.3.2. Symmetry-related features (SRFs).

Based on Principle (P1), the extracted features from in Definition 3.1 should be closely related to the quality of the defined SF. Meanwhile, the features should be cheap to construct, i.e., they should not depend on values of , which are unknown at this stage. For (P2), the number of features should be limited in order to guarantee a simple predictor. A quick solution to transform

into features is to use one-hot encoding. However, such direct solution is far from effective on KGE, since the one-hot representation does not take domain-specific Constraint (C1) into consideration. Besides, it is not efficient as the one-hot feature requires a complex predictor.

Therefore, we are motivated to design symmetry-related features (SRFs), which can efficiently capture to what extend can be symmetric and skew-symmetric, thus correlate well with the performance, and are also invariant to permutation/flipping signs. The SRFs we designed are inspired by following key observation on . Let us look back at the example in Figure 3, if with proper , then , which means such can be symmetric; if and with proper , then , which means such can be skew-symmetric. Based on Proposition 1, the SF defined by Figure 2(a) is a good candidate. However, how to obtain with appropriate values to check the symmetric property of is still a problem. Fortunately, the general structure of can be abstracted as a very small matrix when reduces to be 1-dimensional. Thus, we can directly assign proper scalar value for each as in Figure 4. Then let , the symmetric and skew-symmetric property of can be efficiently checked through and since here is a simple matrix.

Definition 4.1 (SRF).

Let the 1-dimensional degeneration of , , , be scalars . We give with following assignments:

  • [leftmargin=23pt]

  • All of them have the same absolute value, like ;

  • Three of them have the same absolute value while another one not, like ;

  • Two of them have the same absolute value while the other two not, like ;

  • Two of them have the same absolute value, and the other two have another same absolute value, like ;

  • All of them have different absolute value, like .

For each (S1)-(S5), a group of assignments are generated through permuting and flipping the signs. After removing equivalent assignments, only 45 different groups are enough, thus with low complexity. Based on the assigned matrix, the symmetric and skew-symmetric property of under each (S1)-(S5) can be quickly checked, leading to a dimensional SRFs (details in Appendix LABEL:app:srf).

An example of the assignment of (S3) is shown in Figure 4. We firstly use scalars to replace the blocks , and then assign with the values under (S3). Finally, the matrix in Figure 4 is checked to possibly handle symmetric. Meanwhile, the invariance and expressiveness property of the SRFs can be guaranteed by following Proposition 2. Thus, we can see that SRFs can theoretically address Constraint (C1) and problem of permutation and flipping signs.

Figure 4. Example of generating one SRF under (S3).
Proposition 2 ().

The SRF extracted from (S1)-(S5) are (i) invariant to both permutation and flipping sign of blocks in and (ii) give low prediction if cannot be symmetric or antisymmetric.

WN18 FB15k WN18RR FB15k237 YAGO3-10 average ranking
type model MRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10
TDM TransE* 0.5001 94.13 0.4951 77.37 0.1784 45.09 0.2556 41.89 - - -
TransH* 0.5206 94.52 0.4518 76.55 0.1862 45.09 0.2329 40.10 - - -
TransD* 0.5093 94.61 0.4529 76.55 0.1904 46.41 0.2451 42.89 - - -
NNM NTN* 0.53 66.1 0.25 41.4 - - - - - - -
ConvE* 0.943 95.6 0.657 85.4 0.43 52. 0.325 50.1 0.44 62. 4.8
BLM DistMult 0.8461 94.62 0.7389 88.31 0.4307 48.60 0.3056 50.73 0.4707 61.53 4.6
ComplEx 0.9453 95.06 0.7429 88.73 0.4438 49.27 0.3088 50.92 0.4593 60.51 3.8
Analogy 0.9450 95.06 0.7469 88.71 0.4468 49.51 0.3081 50.99 0.4705 61.31 4
SimplE 0.9456 95.04 0.7589 88.23 0.4526 51.12 0.3024 50.37 0.4681 62.01 3
AutoKGE 0.9456 95.13 0.7619 88.15 0.4574 51.75 0.3111 50.99 0.4777 61.58 1.2
Table 3. Comparison of the best SF identified by AutoKGE and state-of-the-arts. The bold number means the best performance, and the underline means the second best. Results of TransE*, TransH* and TransD* are taken from the reimplemented results in (Zhang et al., 2018). NTN* is taken from (Yang et al., 2015). ConvE* is taken from (Dettmers et al., 2017). Average ranking is computed by the rank of MRR performance among “ConvE*, the BLMs and AutoKGE” on all datasets.

Finally, we use a simple 2-layers feed-forward neural network (10-2-1) as the learning model. The network is trained with SRFs based on stochastic gradient descent. In section

5.4, we can see that such predictor can correlate well with actual performance (P1) and quickly learn from small samples (P2).

4.3.3. Discussion: difference with PNAS.

The most related work in the AutoML literature is progressive neural architecture search (PNAS) (Liu et al., 2018). This paper proposes a greedy algorithm to search a cell structure, which is used to built a convolutional neural network (CNN). Besides, PNAS is also supported with a predictor, which helps avoid training candidates in each greedy step. However, while AutoKGE and PNAS are all built on the combination of a greedy search and a predictor, their difference is significant. First, the main concerns during the search are different. PNAS focuses on how to build a topological structure in a cell, and here we need to deal with problem of domain-specific Constraints (C1)-(C4) in Section 3.2, and invariance and and expressiveness in Section 4.1. Thus, the search strategy cannot be transferred from PNAS to AutoKGE. Second, PNAS adopt the direct one-hot encoding for the predictor, which has bad empirical performance here as explained in Section 4.3.2. Both the greedy method in Algorithm 3 and the SRF based predictor are specifically designed by observing important properties in KGE. These make AutoKGE efficient and effective in searching good SFs.

5. Empirical Study

In this section, we first set up experiments in Section 5.1. Then, in Section 5.2, we show AutoKGE’s ability in finding SFs of which performance can beat human-designed ones. The efficiency of AutoKGE is presented in Section 5.3 by comparing with random search schemes. Finally, we show the effectiveness of the designed SRFs for accurate performance estimation in Section 5.4.

5.1. Experiment Setup

Five datasets, i.e., WN18, FB15k, WN18RR, FB15k237 and YAGO3-10 are considered (Table 4). Specifically, WN18RR and FB15k237 are variants that remove near-duplicate or inverse-duplicate relations from WN18 and FB15k respectively. YAGO3-10 is much larger than the others. These are benchmark datasets, and popularly used in the literature (Bordes et al., 2013; Yang et al., 2015; Trouillon et al., 2017; Liu et al., 2017; Kazemi and Poole, 2018; Zhang et al., 2018; Lacroix et al., 2018).

dataset #entity #relation #train #valid #test
WN18 (Bordes et al., 2013) 40,943 18 141,442 5,000 5,000
FB15k (Bordes et al., 2013) 14,951 1,345 484,142 50,000 59,071
WN18RR (Dettmers et al., 2017) 40,943 11 86,835 3,034 3,134
FB15k237 (Toutanova and Chen, 2015) 14,541 237 272,115 17,535 20,466
YAGO3-10 (Mahdisoltani et al., 2013) 123,188 37 1,079,040 5,000 5,000
Table 4. Statistics of the datasets used in experiments.

Following (Yang et al., 2015; Trouillon et al., 2017; Liu et al., 2017; Kazemi and Poole, 2018; Dettmers et al., 2017), we test KGE’s performance based on link prediction task. For each triplet , where is the valid or test set of triplets, we compute the score of for all and get the rank of , same for based on scores of over all . This is also the testbed to measure KGE models. Same as above mentioned papers, we adopt following metrics:

  • [leftmargin=8pt]

  • Mean reciprocal ranking (MRR): It is computed by average of the reciprocal ranks , where is a set of ranking results;

  • Hit@10: It is the percentage of appearance in top- ranking: , where is the indicator function.

MRR and Hit@10 measure the top rankings of positive entity in different degree, and larger values indicate better performance. To avoid underestimating the performance of different models, we report the performance in a “filtered” setting, i.e., all the corrupted triplets that exist in train, valid and test sets are filtered out (Bordes et al., 2013; Wang et al., 2014)

. All of the algorithms are written in python with PyTorch framework

(Paszke et al., 2017) on 8 TITAN Xp GPUs.

# of SFs WN18 FB15k WN18RR FB15k237 YAGO3-10
trained MRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10
24 0.9456 95.04 0.7589 88.23 0.4545 51.29 0.3065 51.27 0.4731 61.38
280 0.9456 95.13 0.7619 88.15 0.4574 51.75 0.3111 50.99 0.4777 61.58
536 0.9456 95.13 0.7619 88.15 0.4574 51.75 0.3111 50.99 0.4777 61.58
792 0.9456 95.13 0.7619 88.15 0.4574 51.75 0.3111 50.99 0.4777 61.58
Table 5. Best performance of the searched SFs identified by AutoKGE as grows. (256 models are trained per greedy step).

5.2. Comparison with State-of-the-arts

In this section, we compare our AutoKGE with various representative and state-of-the-arts KGE models discussed in Section 2.2, of which SFs are designed by humans, i.e., TransE (Bordes et al., 2013), TransH (Wang et al., 2014), TransD (Ji et al., 2015), from TDMs; NTM (Socher et al., 2013) and ConvE (Dettmers et al., 2017) from NNMs; DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2017), Analogy (Liu et al., 2017) and SimplE (Kazemi and Poole, 2018) from BLMs. We choose here, and each is uniformly divided into four parts to support the search of AutoKGE. Besides, we use grid search to select the following common hyper-parameters for all models: learning rate for Adam (Kingma and Ba, 2014) optimizer and for Adagrad (Duchi et al., 2011) optimizer; batch size is chosen among , , , ; to reduce the training time, we use dropout regularizer as a replacement to the L2 penalty as in (Dettmers et al., 2017), and tune , , , , , ; the number of negative sample is , , , . The hyper-parameters are selected by MRR value on valid set.

5.2.1. Effectiveness of the searched SFs.

Testing performance of AutoKGE and state-of-the-arts are compared in Table 3. Firstly, we can see that there is no absolute winner among the baseline SFs. For example, ConvE is the best on FB15k237, but only the fifth among human-designed SFs on FB15k; SimplE well adapts to WN18, FB15k, and WN18RR, but performs worse than other BLMs on FB15k237. However, AutoKGE has consistently good performance among these five datasets, i.e., best among WN18, FB15k, WN18RR and YAGO3-10, and runner up on FB15k237. This demonstrates the effectiveness of AutoKGE.

5.2.2. Distinctiveness of the searched SFs.

(a) WN18.
(b) FB15k.
(c) WN18RR.
(d) FB15k237.
(e) YAGO3-10.
Figure 5. A graphical illustration of SFs identified by our AutoKGE on each dataset.

To show searched SFs are KG-dependent and novel to the literature, we plot them in Figure 5. As shown, these SFs are different from each other, and they are not equivalent under permutation or flipping signs. To further demonstrate the searched SFs are KG-dependent, we pick up the best SF searched from one dataset and test its performance on another dataset. Such performance is shown in Table 6. We can readily find that these SFs get the best performance on datasets where they are searched. This clearly show that SFs identified by AutoKGE on different KGs are distinct with each other.

WN18 FB15k WN18RR FB15k237 YAGO3-10
WN18 0.9456 0.7439 0.4415 0.3012 0.4721
FB15k 0.9338 0.7619 0.4223 0.3013 0.4651
WN18RR 0.9448 0.7378 0.4574 0.3054 0.4606
FB15k237 0.8408 0.7154 0.4392 0.3111 0.4607
YAGO3-10 0.8774 0.6968 0.4299 0.2996 0.4777
Table 6. MRRs of applying SF searched from one dataset (indicated by a row) on another dataset (indicated by a column).

5.2.3. Efficiency of the search algorithm

Finally, we show the best performance of s during search procedure as increases in Table 5. As indicated by the bold numbers, the best SFs are all visited when or 6, and they are searched within only 280 trials of training and evaluation. When gets larger, it becomes more challenging for greedy algorithm to effectively explore the search space. However, as in Section 5.2.1, these identified s already outperform human-designed ones.

WN18 FB15k WN18RR FB15k237 YAGO3-10
step 9 390.510.3 483.95.7 334.515.4 303.93.1 1014.211.2
step 12 1.70.1 2.10.1 2.30.2 1.80.1 2.10.2
Table 7. Running time (min) per greedy step (step 6-13 in Algorithm 3). Except step 9 (model training and evaluation) and step 12 (computing SRFs and predictor training), all other steps take less than 0.1 minutes.

Finally, we show the running time of different components in AutoKGE in Table 7. First, the SRFs based predictor is very efficient and occupies much shorter running time compared with that of the model training. Then, the best SFs can be searched within several hours (on 8 GPUs) indicated by Table 5, and each greedy step contains 256 model training. In comparison, search problem based on reinforcement learning (Zoph and Le, 2017) runs over 4 days across 500 GPUs; genetic programming (Xie and Yuille, 2017) takes 17 days on single GPU; and Bayes optimization (Feurer et al., 2015) trains for several days on CPUs. The proposed AutoKGE makes the search problem on KGE tractable by the progressive greedy search and usage of predictor.

(a) WN18RR.
(b) FB15k237.
Figure 6. Comparison of AutoKGE with random search on search efficiency when .

5.3. Comparison with Random Search

In this part, we show the efficiency of AutoKGE by comparing it with random search. We use WN18RR and FB15k237 here; and to save running time, we set during search procedure. Besides, we use , , and in Algorithm 3 for AutoKGE. We compare AutoKGE with the following methods:

  • [leftmargin=8pt]

  • Greedy+random: Same as Algorithm 3, but top- candidates are randomly selected (not based on the predictor);

  • Absolute Random: SFs are randomly and directly generated.

Note that, as discussed in Section 4.1, other popular search algorithms, e.g., reinforcement learning (Zoph and Le, 2017; Baker et al., 2017), Bayes optimization (Feurer et al., 2015) and genetic programming (Xie and Yuille, 2017) are not applicable thus not compared.

We use average MRR of the top-, where for the evaluation. The comparison among these algorithms on is shown in Figure 6. First, we can see that the average MRR of the top- models gradually increases for all methods, as more models are searched. However, the increasing speed is quite different. As discussed in Section 4, the search challenges are solved by two parts: greedy search algorithm and a carefully designed predictor. Specifically, Greedy+random is faster than Absolute Random, since the greedy step can effectively prune the search space. AutoKGE is the fastest and outperforms Greedy+random due to the predictor.

5.4. Effectiveness of the SRF based Predictor

Finally, we show the effectiveness of the predictor. Following methods are compared with AutoKGE (with designed SRFs):

  • [leftmargin=8pt]

  • AutoKGE+onehot: Same as Algorithm 3, but the predictor is built on one-hot encodings of ;

  • AutoKGE+random: Same as Algorithm 3, but top- candidates are randomly selected (not based on the predictor);

(a) WN18RR.
(b) FB15k237.
Figure 7. Comparison of AutoKGE with the predictor that use SRF and one-hot features.

Again, average MRR of top-, where is used for the evaluation. Results are plotted in Figure 7. First, it is clear that AutoKGE+SRFs is the best in both effectiveness and speed of convergence. Then, AutoKGE+onehot improves upon Greedy+random only slightly since predictor trained on onehot feature is not as powerful as that on SRFs. This demonstrates the needs of carefully designed features for the predictor. The above findings consistently show the effectiveness of designed SRFs.

6. Conclusion

In this paper, we propose AutoKGE, an algorithm to design and discover distinct SFs for KGE automatically. By using a progressive greedy search algorithm and a predictor with domain-specific features, our method can discover promising SFs, that are KG dependent, new to the literature, and outperform state-of-the-arts SFs designed by humans, within only 280 model training selected from a huge search space. In future work, we will further improve the greedy algorithm to better explore the space when is large. The multi-class loss and the regularizer introduced in (Lacroix et al., 2018) will be included to fully exploit the power of BLMs. Finally, it will also be interested to search for SFs that are relation-dependent.


  • (1)
  • Auer et al. (2007) S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 722–735.
  • Baker et al. (2017) B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In ICLR.
  • Bengio et al. (2013) Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. TPAMI 35, 8 (2013), 1798–1828.
  • Bollacker et al. (2008) K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
  • Bordes et al. (2013) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In NIPS. 2787–2795.
  • Dettmers et al. (2017) T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. 2017. Convolutional 2D knowledge graph embeddings. In AAAI.
  • Dong et al. (2014) X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD. 601–610.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, Jul (2011), 2121–2159.
  • Eggensperger et al. (2015) K. Eggensperger, F. Hutter, H. Hoos, and K. Leyton-Brown. 2015.

    Efficient benchmarking of hyperparameter optimizers via surrogates. In

  • Fan et al. (2014) M. Fan, Q. Zhou, E. Chang, and T. F. Zheng. 2014. Transition-based knowledge graph embedding with relational mapping properties. In PACLIC.
  • Feurer et al. (2015) M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. 2015. Efficient and robust automated machine learning. In NIPS. 2962–2970.
  • Getoor and Taskar (2007) L. Getoor and B. Taskar. 2007. Introduction to statistical relational learning. Vol. 1. The MIT Press.
  • Hayashi and Shimbo (2017) K. Hayashi and M. Shimbo. 2017. On the equivalence of holographic and complex embeddings for link prediction. In ACL, Vol. 2. 554–559.
  • Ji et al. (2015) G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In ACL, Vol. 1. 687–696.
  • Kazemi and Poole (2018) M. Kazemi and D. Poole. 2018. SimplE embedding for link prediction in knowledge graphs. In NeurIPS.
  • Kingma and Ba (2014) D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. Technical Report. arXiv:1412.6980.
  • Lacroix et al. (2018) T. Lacroix, N. Usunier, and G. Obozinski. 2018.

    Canonical tensor decomposition for knowledge base completion. In

  • Lao et al. (2011) N. Lao, T. Mitchell, and W. W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In EMNLP. Association for Computational Linguistics, 529–539.
  • Lin et al. (2018) Y. Lin, X. Han, R. Xie, Z. Liu, and M. Sun. 2018. Knowledge representation learning: A quantitative review. Technical Report. arXiv:1812.10901.
  • Lin et al. (2015) Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In AAAI, Vol. 15. 2181–2187.
  • Liu et al. (2018) C. Liu, B. Zoph, S. Jonathon, W. Hua, L. Li, F.-F. Li, A. Yuille, J. Huang, and K. Murphy. 2018. Progressive neural architecture search. In ECCV.
  • Liu et al. (2017) H. Liu, Y. Wu, and Y. Yang. 2017. Analogical inference for multi-relational embeddings. In ICML. 2168–2178.
  • Lukovnikov et al. (2017) D. Lukovnikov, A. Fischer, J. Lehmann, and S. Auer. 2017. Neural network-based question answering over knowledge graphs on word and character level. In WWW. 1211–1220.
  • Mahdisoltani et al. (2013) F. Mahdisoltani, J. Biega, and F. M Suchanek. 2013. Yago3: A knowledge base from multilingual wikipedias. In CIDR.
  • Miller (1995) G. A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  • Nickel et al. (2016a) M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. 2016a. A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1 (2016), 11–33.
  • Nickel et al. (2016b) M. Nickel, L. Rosasco, and T. Poggio. 2016b. Holographic embeddings of knowledge graphs. In AAAI. 1955–1961.
  • Nickel et al. (2011) M. Nickel, V. Tresp, and H. Kriegel. 2011. A three-way model for collective learning on multi-relational data. In ICML, Vol. 11. 809–816.
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In ICLR.
  • Paulheim (2017) H. Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web 8, 3 (2017), 489–508.
  • Singhal (2012) A. Singhal. 2012. Introducing the knowledge graph: Things, not strings. Official Google blog 5 (2012).
  • Socher et al. (2013) R. Socher, D. Chen, C. Manning, and A. Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In NIPS. 926–934.
  • Suchanek et al. (2007) F. Suchanek, G. Kasneci, and G. Weikum. 2007. Yago: A core of semantic knowledge. In WWW. 697–706.
  • Toutanova and Chen (2015) K. Toutanova and D. Chen. 2015. Observed versus latent features for knowledge base and text inference. In Workshop on Continuous Vector Space Models and their Compositionality. 57–66.
  • Trouillon et al. (2017) T. Trouillon, C. Dance, E. Gaussier, J. Welbl, S. Riedel, and G. Bouchard. 2017. Knowledge graph completion via complex tensor factorization. JMLR 18, 1 (2017), 4735–4772.
  • Wang et al. (2017b) Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017b. Knowledge graph embedding: A survey of approaches and applications. TKDE 29, 12 (2017), 2724–2743.
  • Wang et al. (2017a) Y. Wang, R. Gemulla, and H. Li. 2017a. On multi-relational link prediction with bilinear models. In AAAI.
  • Wang et al. (2014) Z. Wang, J. Zhang, J. Feng, and Z. Chen. 2014.

    Knowledge graph embedding by translating on hyperplanes. In

    AAAI, Vol. 14. 1112–1119.
  • Xie and Yuille (2017) L. Xie and A. Yuille. 2017. Genetic CNN. In ICCV. 1388–1397.
  • Xue et al. (2018) Y. Xue, Y. Yuan, Z. Xu, and A. Sabharwal. 2018. Expanding holographic embeddings for knowledge completion. In NeurIPS. 4496–4506.
  • Yang et al. (2015) B. Yang, W. Yih, X. He, J. Gao, and L. Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In ICLR.
  • Yao and Wang (2018) Q. Yao and M. Wang. 2018. Taking human out of learning applications: A survey on automated machine learning. Technical Report. Arxiv: 1810.13306.
  • Zhang et al. (2016) F. Zhang, N. Jing Yuan, D. Lian, X. Xie, and W.-Y. Ma. 2016. Collaborative knowledge base embedding for recommender systems. In SIGKDD. 353–362.
  • Zhang et al. (2018) Y. Zhang, Q. Yao, Y. Shao, and L. Chen. 2018. NSCaching: simple and efficient negative sampling for knowledge graph embedding. Technical Report. arXiv:1812.06410.
  • Zoph and Le (2017) B. Zoph and Q. Le. 2017. Neural architecture search with reinforcement learning. In ICLR.