Efficient Parallel Translating Embedding For Knowledge Graphs

03/30/2017 ∙ by Denghui Zhang, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Knowledge graph embedding aims to embed entities and relations of knowledge graphs into low-dimensional vector spaces. Translating embedding methods regard relations as the translation from head entities to tail entities, which achieve the state-of-the-art results among knowledge graph embedding methods. However, a major limitation of these methods is the time consuming training process, which may take several days or even weeks for large knowledge graphs, and result in great difficulty in practical applications. In this paper, we propose an efficient parallel framework for translating embedding methods, called ParTrans-X, which enables the methods to be paralleled without locks by utilizing the distinguished structures of knowledge graphs. Experiments on two datasets with three typical translating embedding methods, i.e., TransE [3], TransH [17], and a more efficient variant TransE- AdaGrad [10] validate that ParTrans-X can speed up the training process by more than an order of magnitude.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Knowledge graphs are structured graphs with various entities as nodes and relations as edges. They are usually in form of RDF-style triples , where represents a head entity, a tail entity, and the relation between them. In the past decades, a quantity of large scale knowledge graphs have sprung up, e.g., Freebase (Bollacker et al., 2008), WordNet (Miller, 1995), YAGO (Mahdisoltani et al., 2014), OpenKN (Jia et al., 2014), and have played a pivotal role in supporting many applications, such as link prediction, question answering, etc. Although these knowledge graphs are very large, i.e., usually containing thousands of relation types, millions of entities and billions of triples, they are still far from complete. As a result, knowledge graph completion (KGC) has been payed much attention to, which mainly aims to predict missing relations between entities under the supervision of existing triples.

Recent years have witnessed great advances of translating embedding methods to tackle KGC problem. The methods represent entities and relations as the embedding vectors by regarding relations as translations from head entities to tail entities, such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014), TransR(Lin et al., 2015b)

, etc. However, the training procedure is time consuming, since they all employ stochastic gradient descent (SGD) to optimize a translation-based loss function, which may require days to converge for large knowledge graphs.

Time
Complexity
Model
Complexity
on FB15k on Freebase-rdf-latest on the whole Freebase
TransE 4.5s
4658s
78 minutes
17,696s
5 hours
18,323,395s
212 days
29,781s
8 hours
30,828,893s
357 days
TransH 6s 100 minutes 6.5 hours 273 days 11 hours 459 days
TransR 473s 5 days 21.5 days 59 years 36 days 99 years
Table 1. Complexity Analysis of Typical translating embedding Methods, with ,

For instance, Table1111The experiments are conducted on a dual Intel Xeon E5-2640 CPUs (10 cores each 2 hyperthreading, running at 2.4 GHz) machine with 128GB of RAM. The kernel is Red Hat 4.4.7 shows the complexity of typical translating embedding methods, where stands for the total training time with

for the time of each epoch, and one epoch is a single pass over all triples.

, and are the number of entities, relations and triples in the knowledge graph respectively. is the embedding dimension which is the same for entities and relations in this case, and is the minimum epochs which used to be set to . It can be seen that the time complexity of TransE is proportional to , and . When is 100 and is 1000, it will take 78 minutes for TransE to learn the embeddings of FB15k222https://everest.hds.utc.fr/lib/exe/fetch.php?media=en:fb15k.tgz, which is a subset of Freebase with 483,142 training triples, and has been widely used as experimental dataset in knowledge graph embedding methods (Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015b; Jia et al., 2016). Nevertheless, Freebase-rdf-latest333http://commondatastorage.googleapis.com/freebase-public/ is the latest public available data dump of Freebase with 1.9 billion triples, which results in approximately 3932 times the training time, namely, 212 days. Furthermore, the whole Freebase contains over 3 billion triples444https://github.com/nchah/freebase-triples, there are 3,197,653,841 triples in Freebase on May 2, 2016, and it will take about 357 days to learn the embeddings of it. Despite its large size, Freebase still suffers from data incomplete problem, e.g., 75% persons do not have nationalities in Freebase (Dong et al., 2014). On top of that, most improved variants of TransE employ more complex loss function to better train the embedding vectors, thus they possess higher time complexity or model complexity, and the training time of them will be even unbearable. For example, it will take more than 59 years for Freebase-rdf-latest when employing TransR, which is one of the typical improved variants and achieves far better performance than TransE.

There have been attempts to resolve the efficiency issue of translating embedding methods for knowledge graphs. Pasquale(Minervini et al., 2015) proposed TransE-AdaGrad to speed up the training process by leveraging adaptive learning rates. However, TransE-AdaGrad essentially reduces the number of epochs to converge, and still can not do well with large scale knowledge graphs. In fact, with more and more computation resources available, it is natural and more effective to parallel these embedding methods, which will lead to significant improvement in training efficiency and can scale to quite large knowledge graphs if given sufficient hardware resources.

However, it is challenging to parallel the translating embedding
methods, since the training processes mainly employ stochastic gradient descent algorithm (SGD) or the variants of it. SGD is inherently sequential, as a dependence exists between each iteration. Parallelizing translating embedding methods straightforwardly will result in collisions between different processors. For instance, an entity embedding vector is updated by two processors at the same time, and the gradients calculated by these processors are different. In this case, the diverse gradients are called collisions. To avoid collisions, some methods (Langford et al., 2009) lock embedding vectors, which will slow the training process greatly as there are so many vectors. On the contrary, updating vectors without locks leads to high efficiency, but should be based on specific assumptions (Recht et al., 2011; Dean et al., 2012). Since the lock-free training process may result in poor convergence if adopting suboptimal strategy to resolve collisions.

Our key observation of translating embedding methods is that the update performed in one iteration of SGD is based on only one triple and its corrupted sample, which is not necessarily bound up with other embedding vectors. This gives us chance to learn the embedding vectors in parallel without being locked. In this article, we analyze the distinguished data structure of knowledge graphs, and propose an efficient parallel framework for translating embedding methods, called ParTrans-X. It enables translating methods to update the embedding vectors efficiently in shared memory without locks. Thus the training process is greatly speeded up with multi-processors, which can be more than an order of magnitude faster without lowering learning quality.

The contribution of this aritcle is:

1. We explore the law of collisions along with increasing number of processors, by modelling the training data of knowledge graph into hypergraphs.

2. We propose ParTrans-X framework to train translating methods efficiently in parallel. It utilizes the training data sparsity of large scale knowledge graphs, and can be easily applied to many translating embedding methods.

3. We apply ParTrans-X to typical translating embedding methods, i.e., TransE (Bordes et al., 2013), TransH (Wang et al., 2014), and a more efficient variant TransE-AdaGrad, and experiments validate the effectiveness of ParTrans-X on two widely used datasets.

The paper is organized as follows. Related work is in Sec.2. The collision formulation is introduced in Sec.3 and ParTrans-X is proposed based on it in Sec.4. Then, experiments demonstrate training efficiency of ParTrans-X in Sec.5, with conclusions in Sec.6.

2. Related Work

In recent years, translating embedding methods have played a pivotal role in Knowledge Graph Completion, which usually employ stochastic gradient descent algorithm to optimize a translation-based loss function, i.e.,

(1)

where represents the positive triple that exists in the knowledge graph, while stands for the negative triple that is not in the knowledge graph. is the hinge loss , and is the margin between positive and negative triples. is the score function to determine whether the triple should exist in the knowledge graph, which varies from different translating embedding methods.

A significant work is TransE (Bordes et al., 2013), which heralds the start of translating embedding methods. It looks upon a triple as a translation from the head entity to the tail entity , i.e., , and the score function is , where represents L1-similarity or L2-similarity. The boldface suggests the vectors in the embedding space, namely, , , where is the dimension of embedding space, the dimension for entities and for relations. Moreover, TransH (Wang et al., 2014)

assumes that it is the projections of entities to a relation-specific hyperplane that satisfy the translation constraint, i.e.,

, where and , with as the normal vector of the hyperplane related to . Furthermore, TransR (Lin et al., 2015b) employs rotation transformation to project the entities to a relation-specific space, i.e., , where and , and is the projection matrix relation to . Some works also involves more information to better embedding, e.g., paths (Lin et al., 2015a), margins (Jia et al., 2016).

Although this category of methods achieve the state-of-the-art results, the main limitation is the computationally expensive training process when facing large scale knowledge graphs. Recently, a method TransE-AdaGrad (Minervini et al., 2015) was proposed to reduce the training time of TransE by employing AdaGrad (Duchi et al., 2011), an variant of SGD, to adaptively modify the learning rate. Although the training time has been reduced greatly, there is still some way to go when facing large scale knowledge graphs. With the computation resources greatly enriched, training in parallel seems to be a more reliable way to relieve this issue. Actually, there are some works, e.g., (Shao et al., 2015), to parallel some graph computation paradigms, such as online query processing, offline analytics, etc. Nevertheless, it is not easy to train translating embedding methods in parallel, since the main optimation algorithm SGD is born to run in sequence. The major obstacle to parallel SGD is the collisions between updates of different processors for the same parameter (Ruder, 2016), to overcome which there are two main brunches of methods.

The first brunch is to design a strategy to resolve collisions according to specific data structure. For example, Hogwild! (Recht et al., 2011) is a lock-free scheme works well for sparse data, which means that there is only a small part of parameters to update by each iteration of SGD. It has been proved that processors are unlikely to overwrite each other’s progress, and the method can achieve a nearly optimal rate of convergence. While the second brunch is to split the training data to to reduce collisions. Downpour SGD (Dean et al., 2012) mainly employ DistBelief (Dean et al., 2012)

framework, which divides the training data into a number of subsets, then the model replicas run independently on each of these subsets, and do not communicate with each other. Inspired by this, TensorFlow

(Abadi et al., 2016)

splits a computation graph into a subgraph for every worker and communication takes place using Send/Receive node pairs. Motivated by training large-scale convolutional neural networks for image classification, Elastic Averaging SGD (EASGD)

(Zhang et al., 2015) reduces the amount of communication between local workers and the master to allow the parameters of local workers to fluctuate further from the center ones. There are also works to improve the performance in parallel settings, e.g., Delay-tolerant Algorithms for SGD (Mcmahan and Streeter, 2014) adapts not only to the sequence of gradients, but also to the precise update delays that occur, inspired by AdaGrad.

However, these parallel framework are based on specific assumptions, and can not directly apply to translating embedding models without exploring distinguished data structures of knowledge graphs. Therefore, we shall propose a parallel framework for translating embedding models, called ParTrans-X, as knowledge graphs are mainly in form of triples, and trained triple by triple, it will lead to particular parallel framework.

3. Law of Collisions Emerging in KG

As mentioned previously, there may exist collisions between processors when they update the same embedding vector, which ends up being one of the most challenging aspects of parallelizing translating embedding methods. Hence, we explore the law of collisions emerging in this section. At first we formulate the training data of knowledge graphs into hypergraphs. Then the collisions in training process are further discussed based on this formulation.

3.1. Hypergraph Formulation

Firstly, we model the knowledge graph formally as , where is the set of entities with the set of relations, and is the set of triples , in which and . The cardinalities of , and are , and respectively. In this graph, nodes are entities, and edges are triples that connecting nodes with a distinguished relation. For example, the knowledge graph shown in Figure1(a), where black nodes stand for the entities in knowledge graphs and lines for relations, can be represented as , where , and . In this case, , and .

Figure 1. A Knowledge Graph (a) and one of the Hypergraphs generated by its training data (b).

Secondly, the training data of knowledge graphs can be looked upon as hypergraphs. Recall the loss function of translating embedding methods in Eq.(1), which means in one iteration of SGD, only one positive triple and one negative triple are concerned. To be more clear, the data used in one iteration, i.e., , is called a sample. Note that is constructed by substituting one entity or for or respectively, contributing to a corrupted triple or , which is just simply denoted by following (Bordes et al., 2013). Consequently, a sample corresponding to three entities, i.e., , and one relation . As a result, the training data can be formulated in to a 4-uniform hypergraph, in which all the hyperedges have the same cardinality . In this hypergraph, nodes are entities and relations, and edges are training samples containing 4 nodes, i.e., three entities and one relation. More formally,

Definition 3.1 ().

The training data to embed the knowledge graph by translating embedding methods is organized as a 4-uniform hypergraph , where is the set of entities or relations, and is the set of training samples , where .

For example, the hypergraph in Figure1(b) is one of the hypergraphs generated by Figure1(a), where black nodes are entities and colored nodes are relations, and the colored blocks represent hyperedges. Here, different colors are related to different relations. For instance, for triple , the negative triple sampled in Figure1(b) is , which contributes to a sample , ,,, thus the hyperedge colored by red contains and . Note that many other negative triples can be constucted, e.g., for triple , and the hypergraph generated in Figure1(b) is just an example. Similarly, the other samples in Figure1(b) are , and .

To better analyze the collisions between processors, we define the following statistics of the hypergraph . Given a hyperedge ,

(2)

denotes the set of hyperedges containing the same relations with hyperedge .

(3)

denotes the maximal number of hyperedges containing same relations, where denotes the cardinality.

(4)

denotes the set of hyperedges containing one or more same entities with hyperedge .

(5)

denotes the maximal number of hyperedges containing same entities, where denotes the cardinality the same as before.

3.2. Collision Formulation

In this section, we will verify that it is highly possible that few collisions happen when training by processors for large and sparse knowledge graphs. Let represent the event that processors select different samples. represents the event that there are collisions between relations, i.e., different processors updates a same relation vector, and between entities similarly. The verification is decomposed into two steps, 1) to prove it is quite likely that the processors handle different samples, i.e., , which is the prerequisite to no collisions; 2) to prove it is unlikely that these different samples correspond to the same relations or entities, i.e., and .

Supposing that for embedding methods and the knowledge graph, the training samples of size is drawn independent and identically distributed (i.e., i.i.d.) from some unknown distribution

. Therefore, the probability of

being selected is supposed to be

(6)

Moreover, according to i.i.d., it is reasonable to assume that the sample selecting process by processors is an observation from a Multinomial Distribution, i.e., selecting one sample from samples and repeated times. Let denote the number of processors that select during the same iteration of SGD, then the possibility of being selected by processors, , being selected by processors is as follows,

(7)

where indicates that there are and only samples being selected in the same iteration of SGD.

Theorem 3.2 ().

For a knowledge graph with triples and training by processors in parallel, when is large and is relatively small, the possibility that processors select different samples is

(8)

with probability at least , where

(9)
Proof.

Provided that samples selected by processors are different, it can be easily derived that Then there are only sampling circumstances satisfying no collisions between samples, where distinct samples are selected once, and other samples are not selected, e.g., . Therefore, according to Eq.(3.2) and Eq.(6),

When is large and is relatively small, . ∎

Theorem 3.3 ().

For a knowledge graph with triples and training in processors in parallel, when is relatively small and , we have the possibility of no relation in a collision is

(10)

with probability at least , where

(11)
Proof.

Given that processors select different samples, the posibility of relations in a collision can be deduced according to conditional probability as follows,

(12)

where is the possibility of samples containing distinct relations being selected, which is supposed to be similar to sampling without replacement. More precisely, assuming a sample is selected randomly, then the next sample selected should be from , and the third sample should be selected from in . Accordingly, is deduced as follows when is satisfied,

By Eq.(12), the possibility of no collisions between relations in different processors is

(13)

Note that results in , which means is so large that one or more processors will definitely select the same relation among processors, namely, . Furthermore, when is relatively small, .

Similarly, the possibility of no entities in a collision can be derived as follows, and no more tautology here due to the limitation of length.

Theorem 3.4 ().

For a knowledge graph with triples and training in processors in parallel, when is relatively small and , the possibility of no collisions between entities is

(14)

with probability at least , where

(15)

It is verified in Theorem3.2, Theorem3.3 and Theorem3.4 that if is large and and are relatively small, i.e., the knowledge graph is large and sparse, the number of processors can be very large with supportable collisions, which enables the training process to run in parallel. Motivated by this, we define sparsity of training data in a knowledge graph by . The smaller its value is, the more processors can be used to parallel the training process. Actually, it is the large and sparse knowledge graphs that are in dire need of parallel translating embedding methods. Since they are far from completion, but are too large to train in serial. Besides, since and is deduced by the worst case, it is reasonable to assume that the average and can better reflect the general structures in knowledge graphs, and the collisions will be less in practice. As a result, we suppose that it would still work well if the average and are relatively small, as a few collisions will not affect the consistency.

3.3. Special Insights on Parallelizing TransE

There is an interesting finding that TransE can be further parallelized than other translating embedding methods, since there are less collisions due to the distinguished score function . More precisely, the gradient calculation of TransE when using -similarity is as follows,

(16)

where represents the -th dimension of embedding vector , , and is the dimension of embedding space. It can be seen that in TransE, the gradient of each dimension is independent of other dimensions, which means that the collisions between different dimensions of the same embedding vector will not disturb each other. That is to say, only the collisions between the same dimension of the same embedding vector will matter in the training process of TransE.

Figure 2. Updating embedding vector in parallel

For example, Figure2 shows the updating of by two processors (Processor1 and Processor2) at the same time, where is the gradient of calculated by Processor1, and by Processor2. Normally, when Processor2 calculates the gradient , the whole embedding vector will be involved, which is half updated by Processor1. Obviously, this will result in training errors. On the contrary, if it is the training process of TransE in Figure2, the calculation of by Processor2 only concerns the -th dimension . As a result, there will no disturbance between Processor1 and Processor2, as long as the two processors are not performing update to the same dimension of the same embedding vector.

Consequently, the possibility of collisions emerging is greatly decreased for TransE. Since not only the entities or relations are the same one, but also the dimensions being updated are the same. Namely, the maximal degree of parallelism is far larger than other translating embedding methods. This indicates that parallelizing without locks is ideally situated for TransE, and may scale well to extremely large knowledge graphs by given sufficient computation resources.

4. The ParTrans-X Framework

Inspired by the findings that collisions between processors are negligible when a knowledge graph is large and sparse, a parallel framework for these methods is designed, called ParTrans-X, and we will describe it in detail in this section.

4.1. Framework Description

The pseudocode for implementation of ParTrans-X is shown in Algorithm 1. As the embedding vectors are updated frequently, they are stored in shared memory and every processors can perform updates to them freely.

The training process of ParTrans-X starts with initializing the embedding vectors according to Uniform or Bernoulli Distribution, where no parallel section is needed since it takes constant time. However we can parallel the learning process of each epoch, which is the most time consuming part. Running by

processors in parallel can decrease the training epochs by times, i.e., the parallel training epoch is . To do this, we first determine the random sampling seed by calling for the -th processor. The random sampling seeds differ from each other to avoid same pseudo-random sequence for different processors. Then, each processor performs embedding learning procedure epoch by epoch asynchronously (lines 5-12). One epoch is a loop over all triples. Each loop is done by firstly normalizing the entity embedding vectors following (Bordes et al., 2013). Then a positive triple is sampled from shared memory, where means that the current processor is -th processor, and superscript stands for -th epoch. According to , a negative triple is generated by sampling a corrupted entity (or ) from shared memory, where and are the same as before. That is to say, a sample is constructed by and , which then be used to calculate the gradient according to Eq.(1), and update the embeddings of entities and relations .

0:    Training triples , entities and relations set and , embedding dimension , margin , training epochs , the number of processors ;
0:     Embeddings of entities and relations;
1:  Initialize and

by uniform distribution and persist them in the shared memory

2:  for  to  do
3:                     In Parallel
4:     
5:     for  to  do
6:        loop
7:            for each entity
8:           
9:           
10:           
11:            where
12:           
13:        end loop
14:     end for
15:  end for
16:  Generate embeddings of and after all processors finish
Algorithm 1 ParTrans-X

4.2. Application to Typical translating embedding Methods

The framework can be applied to many translating embedding methods, which employ SGD or its variants to optimize the hinge loss with similar algorithm framework, and are only different in the score function as mentioned in Sec.2, e.g., TransE, TransH and so on. Hence, the parallel algorithm of them can be obtained by applying the corresponding score function in Lines 11-12 of the pseudocode in Algorithm 1.

For example, for TransE, the gradient updating procedure in Lines 11 is performed according to Eq.(3.3). For TransH, which employs the score function , the gradient updating procedure of in Lines 11 is as follows,

(17)

Namely, ParTrans-X has the flexibility to parallel many translating embedding methods, since they possess similar training process.

Moreover, ParTrans-X can be directly applied to the improved variant TransE-AdaGrad, since the training data sparsity of knowledge graph still holds. In one iteration of AdaGrad, it updates the embedding vectors according to the gradient from the previous iteration. Highly similar to SGD, AdaGrad can be easily parallelled using our framework by only performing a learning rate calculation procedure during the gradient update procedure, i.e., Line 12 of the pseudocode in Algorithm 1. For example, to parallel TransE-AdaGrad, the learning rate is determined adaptively by adding

(18)

before Line 12 in Algorithm 1, where is the current epoch, with the learning rate of -th epoch. represents all the previous gradient before -th epoch. is the initial learning rate.

5. Experiment

Firstly, we apply ParTrans-X to TransE, TransH and TransE-Adagrad in Sec.5.1. In Sec.5.2, experiment results demonstrate excessive decline in training time by ParTrans-X, with scaling performance along with increasing number of processors shown in Sec.5.3.

5.1. Experimental Settings

The datasets employed are two representative datasets WN18 and FB15k, which are subsets of well-known knowledge graphs WordNet and Freebase respectively, and have been widely used by translating embedding methods (Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015b; Jia et al., 2016). Table2 shows the statistics of them. Without loss of generality, and are also shown, and they are both small on WN18 and FB15k. Furthermore, it can be seen that the two datasets possess different characteristics. Namely, WN18 possesses only 18 relations, which results in large possibility of collisions between relations. On the contrary, FB15k is less unbalanced in the number of entities and relations.

Data # Rel # Ent #Train #Valid #Test
WN18 18 40,943 141,442 5,000 5,000 1.7e-1 5.6e-4
FB15k 1,345 14,951 483,142 50,000 59,071 9.3e-3 2.3e-3
Table 2. Two widely used datasets in KGs.
Metric WN18 FB15k
Mean Rank Hits@10
Training
Time(s)
Speedup
Ratio
Mean Rank Hits@10
Training
Time(s)
Speedup
Ratio
Raw Filter Raw Filter Raw Filter Raw Filter
TransE 214 203 58.2 65.9 473 - 184 73 44.5 60.7 4658 -
ParTransE 217 206 55.7 63.1 54 9 185 74 44.4 60.5 364 13
[0.8pt/2pt] TransE-AdaGrad 209 197 68.9 77.7 100 4.7 185 69 45.3 62.3 496 9
ParTransE-AdaGrad 219 208 67.7 76.2 17 28 (4.7) 186 70 44.9 61.9 42 111 (9)
TransH 227 216 66.5 75.9 637 - 183 60 46.6 65.5 6066 -
ParTransH 215 203 66.8 76.6 134 4.8 183 60 46.8 65.7 474 13
Table 3. Link prediction performance with all time measured in wall-clock seconds.

To tackle the problem, experiments are conducted on the link prediction task which aims to predict the missing entities or for a triple . Namely, it predicts given or predict given . Similar to the setting in (Bordes et al., 2013), the task returns a list of candidate entities from the knowledge graph.

To evaluate the performance of link prediction, we adopt Mean Rank and Hits@10 under “Raw” and “Filter” settings as evaluation measure following (Bordes et al., 2013). Mean Rank is the average rank of the correct entities, and Hits@10 is proportion of correct entities ranked in top-10. It is clear that a good predictor has low mean rank and high Hits@10. This is called “Raw” setting, and “Filter” setting filters out the corrupted triples which are correct.

To evaluate the speed up performance, we adopt Training Time and Speed-up Ratio as evaluation measures, where Training Time is measured using wall-clock seconds. Speed-up Ratio is

(19)

where is the training time in serial, and is the training time under parallel methods.

Baselines include typical translating embedding methods, TransE, TransH and TransE-Adagrad, which can all be trained in parallel using the ParTrans-X framework, denoted by ParTransE, ParTransH and ParTransE-Adagrad respectively in Table3. Note that TransE and TransH adopt the programs publicly available555https://github.com/thunlp/KB2E, which are the most efficient serial versions to our knowledge, and TransE-Adagrad is implemented based on TransE.

Each experiment is conducted 10 times and the average is taken as results, with all time measured in wall-clock seconds. Our experiments are carried out on dual Intel Xeon E5-2640 CPUs, and each of them possesses 10 physical cores 20 logical cores and running at 2.4 GHz. The machine has 128 GB RAM and runs Red Hat 4.4.7. The language used is C++ and the program is compiled with the gcc compiler version 6.3.0. We use OpenMP for multithreading, each thread binds a processor.

5.2. Link Prediction Peformance of ParTrans-X

Experiments on each baseline and its parallel implementation in ParTrans-X employ the same hyper-parameters, which are decided on the validation set. The learning rate during the stochastic gradient descent process is selected among {0.1,0.01,0.001}, the embedding dimension and are selected in {20,50,100}, the margin between positive and negative triples is selected among {1,2,3,4}. For TransE and ParTransE, the parameters are on WN18, and on Fb15k. For TransH and ParTransH, the parameters are on WN18, and on Fb15k. For TransEAdaGrad and ParTransE-AdaGrad, the parameters are on WN18, and on Fb15k. All the experiments employ -similarity. ParTransE, ParTransH and ParTransE-AdaGrad all run in processors for both datasets.

It can be observed from Table3 that:

1. Link prediction performance in parallel is as good as the serial counterparts on both WN18 and FB15k, which demonstrates that ParTrans-X will not affect embedding performance.

2. The training time is greatly reduced by ParTrans-X. On WN18, TransE-AdaGrad only speeds up TransE by 4.7 times, compared to our 28 times. On FB15k, the training time of TransE is reduced from more than 1 hour to less than 1 minute by ParTransE-AdaGrad.

3. ParTrans-X achieves higher speedup ratio on FB15k than on WN18. Since FB15k has far more training triples than WN18, the time of each epoch on FB15k is much longer than WN18. As a result, the overhead of multi-threading is less important compared to the whole training time on FB15k, which leads to a higher speedup ratio. It further validates the superiority of ParTrans-X to handle the data with large size.

4. ParTrans-X achieves enormous improvement on training time when applying to TransEAdaGrad, especially on FB15k, where the speedup ratio has been improve to 111 from 9. Since AdaGrad decreases the total epochs needed by making the convergence come earlier, and ParTrans-X reduce training time by running in parallel, the two different strategies can achieve higher speedup ratio when combined.

WN18
FB15k
Figure 3. The descent process of loss.

Moreover, the descent process of loss for the three algorithms on WN18 and FB15k is shown in Figure3. It can be seen that, for both datasets, the loss optimizing by ParTrans-X has already fallen sharply in the preceding epochs, and it yields sensibly lower values of the loss than TransE-AdaGrad and TransE even after a few iterations( 5 epoches). Still, ParTrans-X performs better on FB15k than WN18, shows that it is more effective on large data size.

5.3. Scaling Results for Multi-Processors

Furthermore, we carry out a number of experiments to test if the implementations scale with increasing number of processors. We mainly analyze two aspects of experiment results, i.e., the training time and the link prediction performance.

WN18
FB15k
Figure 4. Log-log plot of Training Time along with number of processors

Figure4 shows the log-log plot of the training time in wall-clock seconds for different number of processors. We can observe that the training time continue to decrease along with the increasing number of mutli-processors on both WN18 and FB15k. While the absolute training time of ParTransE-AdaGrad is better than ParTransE, which is better than ParTransH, consistent with the previous result. Moreover, the total training time of ParTransE-AdaGrad drops sharply when processor number is less than four, it is because the training time of ParTransE-AdaGrad with few processors is fairly short, the increase of communication time cost with the more processors has larger effect on the total training time compared with other methods, which leads to small decline.

WN18
FB15k
Figure 5. Hits@10 performance along with number of processors

The predictive performance measured by Hits@10 along with increasing number of processors is shown in Figure5. It can be seen that ParTransE, ParTransE-AdaGrad and ParTransH always maintain good performance, which validates the applicability and superiority of ParTrans-X. Note that the performance on FB15k is more stable than WN18, since there are more training triples in FB15k, and the model will learn more sufficient so that the stability of predictive performance is better on FB15k, which validates the superiority of ParTrans-X on large data size.

6. Conclusion

In this paper, we explore the law of collisions emerging in knowledge graphs by modelling training data to hypergraphs. Our key observation is that one learning iteration only concerns few embeddings, which is not necessarily bound up with others, thus the probability of collisions between different processors can be negligible. Based on this assumption, we propose an efficient parallel framework for translating embedding methods, called ParTrans-X. It employs the intrinsic sparsity of training data in large knowledge graphs, which enables the embedding vectors to be learnt without locks and not inducing errors. Experiments validate that ParTrans-X can speed up the training process by more than an order of magnitude, without degrading embedding performance. The source code of this paper can be obtained from here666https://github.com/zdh2292390/ParTrans-X.

7. Acknowledge

We thank Jun Xu and the anonymous reviewers for valuable suggestions. The work was funded by National Natural Science Foundation of China (No. 61572469, 61402442, 91646120,61572473, 61402
022), the National Key R&D Program of China (No. 2016QY02D0405, 2016YFB1000902), and National Grand Fundamental Research 973 Program of China (No. 2013CB329602, 2014CB340401).

References

  • (1)
  • Abadi et al. (2016) Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin. 2016.

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.

    (2016).
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, 1247–1250.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems. 2787–2795.
  • Dean et al. (2012) Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le, Mark Z Mao, Marc’Aurelio Ranzato, Andrew Senior, and Paul Tucker. 2012. Large scale distributed deep networks. In International Conference on Neural Information Processing Systems. 1223–1231.
  • Dong et al. (2014) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. ACM. 601–610 pages.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159.
  • Jia et al. (2014) Yantao Jia, Yuanzhuo Wang, Xueqi Cheng, Xiaolong Jin, and Jiafeng Guo. 2014. OpenKN: An open knowledge computational engine for network big data. In Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 657–664.
  • Jia et al. (2016) Yantao Jia, Yuanzhuo Wang, Hailun Lin, Xiaolong Jin, and Xueqi Cheng. 2016. Locally Adaptive Translation for Knowledge Graph Embedding. In AAAI.
  • Langford et al. (2009) John Langford, Alexander J Smola, and Martin Zinkevich. 2009. Slow learners are fast. In International Conference on Neural Information Processing Systems. 2331–2339.
  • Lin et al. (2015a) Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015a. Modeling Relation Paths for Representation Learning of Knowledge Bases. Computer Science (2015).
  • Lin et al. (2015b) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015b. Learning entity and relation embeddings for knowledge graph completion. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    . 2181–2187.
  • Mahdisoltani et al. (2014) Farzaneh Mahdisoltani, Joanna Biega, and Fabian Suchanek. 2014. YAGO3: A Knowledge Base from Multilingual Wikipedias. (2014).
  • Mcmahan and Streeter (2014) H. B. Mcmahan and M. Streeter. 2014. Delay-tolerant algorithms for asynchronous distributed online learning. Advances in Neural Information Processing Systems 4 (2014), 2915–2923.
  • Miller (1995) George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  • Minervini et al. (2015) Pasquale Minervini, Claudia d’Amato, Nicola Fanizzi, and Floriana Esposito. 2015. Efficient Learning of Entity and Predicate Embeddings for Link Prediction in Knowledge Graphs.. In URSW@ ISWC. 26–37.
  • Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693–701.
  • Ruder (2016) Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. (2016).
  • Shao et al. (2015) Bin Shao, Yatao Li, and Haixun Wang. 2015. Parallel Processing of Graphs. (2015).
  • Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge Graph Embedding by Translating on Hyperplanes. AAAI - Association for the Advancement of Artificial Intelligence (2014).
  • Zhang et al. (2015) Sixin Zhang, Anna Choromanska, and Yann Lecun. 2015. Deep learning with Elastic Averaging SGD. Computer Science (2015), 685–693.