Knowledge graphs (KGs) are data structures that store information about different entities (nodes) and their relations (edges). They are used to organize information in many domains such as music, movies, (e-)commerce, and sciences. A common approach of using KGs in various information retrieval and machine learning tasks is to compute knowledge graph embeddings (KGE) (Wang et al., 2017; Goyal and Ferrara, 2018). These approaches embed a KG’s entities and relation types into a
-dimensional space such that the embedding vectors associated with the entities and the relation types associated with each edge satisfy a pre-determined mathematical model. Numerous models for computing knowledge graph embeddings have been developed, such as TransE(Bordes et al., 2013), TransR (Lin et al., 2015) and DistMult (Yang et al., 2015).
As the size of KGs has grown, so has the time required to compute their embeddings. As a result, a number of approaches and software packages have been developed that exploit concurrency in order to accelerate the computations. Among them are GraphVite (Zhu et al., 2019)
, which parallelizes the computations using multi-GPU training and Pytorch-BigGraph (PBG)(Lerer et al., 2019), which uses distributed training to split the computations across a cluster of machines. However, these approaches suffer from high data-transfer overheads and low computational efficiency. As a result, the time required to compute embeddings for large KGs is high.
In this paper we present various optimizations that accelerate KGE training on knowledge graphs with millions of nodes and billions of edges using multi-processing, multi-GPU, and distributed parallelism. These optimizations are designed to increase data locality, reduce communication overhead, overlap computations with memory accesses, and achieve high operation efficiency.
We introduce novel approaches of decomposing the computations across different computing units (cores, GPUs, machines) that enable massive parallelization while reducing write conflicts and communication overhead. The write conflicts are reduced by partitioning the processing associated with different relation types across the computing units as well as reducing data communication on multi-GPU training. The communication overhead is reduced by using a min-cut-based graph partitioning algorithm (METIS (Karypis and Kumar, 1998)) to distribute the knowledge graph across the machines. For entity embeddings, we introduce massive asynchronicity by having separate processes to compute the gradients of embeddings independently as well as allowing entity embedding updates overlapped with mini-batch computation. Finally, we use various negative sampling strategies to construct mini-batches with a small number of embeddings involved in a batch, which reduces data movement from memory to computing units (e.g., CPUs and GPUs).
We implement an open-source KGE package called DGL-KE that incorporates all of the optimization strategies to train KG embeddings on large KGs efficiently. The package is implemented with Python on top of Deep Graph Library (DGL) (Wang et al., 2019)
along with a C++-based distributed key-value store specifically designed for DGL-KE. We rely on DGL to perform graph-related computation, such as sampling, and rely on existing deep learning frameworks, such as Pytorch(Paszke et al., 2017) and MXNet (Chen et al., 2015)
, to perform tensor computation. DGL-KE is available athttps://github.com/awslabs/dgl-ke.
We experimentally evaluate the performance of DGL-KE on different knowledge graphs and compare its performance against GraphVite and Pytorch-BigGraph. Our experiments show that DGL-KE is able to compute embeddings whose quality is comparable to that of competing approaches at a fraction of their time. In particular, on knowledge graph containing over 86M nodes and 338M edges DGL-KE can compute the embeddings in 100 minutes on a EC2 instance with 8 GPUs and 30 minutes on an EC2 instance with 4 machines with 48 cores/machine. These results represent a and speedup over the time required by GraphVite and Pytorch-BigGraph, respectively.
Definitions & Notation
A graph is composed of vertices and edges , where is the set of vertices and is the set of edges. A knowledge graph (KG) is a special type of graph whose vertices and edges have types. It is a flexible data structure that represents entities and their relations in a dataset. A vertex in a knowledge graph represents an entity and an edge represents a relation between two entities. The edges are usually in the form of triplets , each of which indicates that a pair of entities (head) and (tail) are coupled via a relation .
Knowledge graph embeddings are low-dimensional representation of entities and relations. These embeddings carry the information of the entities and relations in the knowledge graph and are widely used in tasks, such as knowledge graph completion and recommendation. Throughput the paper, we denote the embedding vector of head entity, tail entity and relation with , and , respectively; all the embedding have the same dimension size of .
Knowledge Graph Embedding (KGE) Models
KGE models train entity embeddings and relation embeddings in a knowledge graph. They define a score function on the triplets and optimize the function to maximize the scores on triplets that exist in the knowledge graph and minimize the scores on triplets that do not exist.
Many score functions have been defined to train knowledge graph embeddings (Wang et al., 2017) and Table 1 lists the ones used by the KGE models supported by DGL-KE. TransE and TransR are two representative translational distance models, where we use L1 or L2 to define the distance. DistMult, ComplEx, and RESCAL are semantic matching models that exploit similarity-based scoring functions. Some of the models are much more computationally expensive than other models. For example, TransR is times more computationally expensive than TransE because TransR has additional matrix multiplications on both head and tail entity embeddings, instead of just element-wise operations on embeddings in TransE.
|TransE (Bordes et al., 2013)|
|TransR (Lin et al., 2015)|
|DistMult (Yang et al., 2015)|
|ComplEx (Trouillon et al., 2016)|
|RESCAL (Nickel et al., 2011)|
|RotatE (Sun et al., 2019)|
To train a KGE model, we define a loss functions on a set of positive and negative samples from the knowledge graph. Two loss functions are commonly used. The first is the a logistic loss given by
where and are the positive and negative sets of triplets, respectively and is is the label of a triplet, for positive and for negative. The second is the pairwise ranking loss given by
A common strategy of generating negative samples is to corrupt a triplet by replacing its head entity or tail entity with entities sampled from the graph with some heuristics to form a negative sampleor , where and denote the randomly sampled entities. Potentially, we can corrupt the relation in a triplet. In this work, we only corrupt entities to generate negative samples.
Mini-batch training and Asynchronous updates
A KGE model is typically trained in a mini-batch fashion. We first sample a mini-batch of triplets that exist in the knowledge graph. The mini-batch training is sparse because a batch only involves in a small number of entity embeddings and relation embeddings. We can take advantage of the sparsity and train KGE models asynchronously with sparse gradient updates (Recht et al., 2011)
. That is, we sample multiple mini-batches independently, perform asynchronous stochastic gradient descent (SGD) on these mini-batches in parallel and only update the embeddings involved in the mini-batches. This training strategy maximizes parallelization in mini-batch training but may lead to conflicts in updating gradients. When two mini-batches run simultaneously, they may use the same entity or relation embeddings. In this case, the gradient of the embeddings is computed based on the stale information, which results in a slower convergence or not converging to the same local minimum.
A naive implementation of KGE training results in low computation-to-memory density for many KGE models, which prevents us from using computation resources efficiently. When performing computation on a batch, we need to move a set of entity and relation embeddings to computation resources (e.g., CPUs and GPUs) from local CPU memory or remote machines. For example, for a mini-batch with positive triplets, negative triplets, and -dimensional embeddings, both the computational and data movement complexity of TransE is , resulting in a computational density of . Given that computations are faster than memory accesses, reducing data movement is key to achieving efficient KGE training.
In addition, we need to take advantage of parallel computing resources. This includes multi-core CPUs, GPUs and a cluster of machines. Our training algorithm needs to allow massive parallelization while still minimizing conflicts when updating embeddings in parallel.
In this work, we implement DGL-KE on top of DGL (Wang et al., 2019), completely with Python. It relies on DGL for graph computation, such as sampling, and relies on deep learning frameworks, such as Pytorch and MXNet, for tensor operations.
DGL-KE provides a unified implementation for efficient KGE training on different hardware. It optimizes for three types of hardware configurations: (i) many-core CPU machines, (ii) multi-GPU machines, and (iii) a cluster of CPU/GPU machines. In each type of the hardware, DGL-KE parallelizes the training with multiprocessing to fully utilize the parallel computation power of the hardware.
For all different hardware configurations, the training process starts with a preprocessing step to partition a knowledge graph and follows with mini-batch training. The partitioning step assigns a disjoint set of triplets in a knowledge graph to a process so that the process performs mini-batch training independently.
The specific steps performed during each mini-batch are:
Samples triplets from the local partition that belongs to a process to form a mini-batch and constructs negative samples in the mini-batch.
Fetches entity and relation embeddings that are involved in the mini-batch from the global entity and relation embedding tensors.
Performs forward computation and back-propagation on the embeddings fetched in the previous step in order to compute the gradients of the embeddings.
Applies the gradients to update the embeddings involved in the mini-batch. This step requires to apply an optimization algorithm to adjust the gradients and write the gradients back to the global entity and relation embedding tensors.
KGE training on a knowledge graph involves two types of data: the knowledge graph structure and the entity and relation embeddings. As illustrated in Figure 1
, we deploy different data placement for different hardware configurations. In many-core CPU machines, DGL-KE keeps the knowledge graph structure as well as entity and relation embeddings in shared CPU memory accessible to all processes. A trainer process reads the entity and relation embeddings from the global embeddings directly through shared memory. In multi-GPU machines, DGL-KE keeps the knowledge graph structure and entity embeddings in shared CPU memory because entity embeddings are too large to fit in GPU memory. It may place relation embeddings in GPU memory to reduce data communication. As such, a trainer process reads entity embeddings from CPU shared memory and reads relation embeddings directly from GPU memory. In a cluster of machines, DGL-KE implements a C++-based distributed key-value store (KVStore) to store both entities and relation embeddings. The KVStore partitions the entity embeddings and relation embeddings automatically and strides them across all KVStore servers. A trainer process accesses embeddings from distributed KVStore with the pull and push API. We partition the knowledge graph structure and each trainer machine stores a partition of the graph. The graph structure of the partition is shared among all trainer processes in the machine.
The rest of this section describes various optimization techniques that we developed in DGL-KE: graph partitioning in the preprocessing step (Section 3.2), negative sampling (Section 3.3), data access to relation embeddings (Section 3.4), and finally applying gradients to the global embeddings (Section 3.5).
3.2. Graph partitioning
In distributed training, we partition the graph structure and embeddings and store them across the machines of the cluster. During the training, each machine may need to read entity and relation embeddings from other machines to construct mini-batches. The key of optimizing distributed training is to reduce communication required to retrieve and update entity and relation embeddings.
To reduce the communication caused by entity embeddings in a batch, we deploy METIS partitioning (Karypis and Kumar, 1998) on the knowledge graph in the preprocessing step. For a cluster of machines, we split the graph into partitions so that we assign a METIS partition (all entities and triplets incident to the entities) to a machine as shown in Figure 2. With METIS partitioning, the majority of the triplets are in the diagonal blocks. We co-locate the embeddings of the entities with the triplets in the diagonal block by specifying a proper data partitioning in the distributed KVStore. When a trainer process samples triplets in the local partition, most of the entity embeddings accessed by the batch fall in the local partition and, thus, there is little network communication to access entity embeddings from other machines.
3.3. Negative sampling
KGE training samples triplets to form a batch and construct a large number of negative samples for each triplet in the batch. For all different hardware, DGL-KE performs sampling on CPUs and offloads the entire sampling computation to DGL for efficiency. If we construct negative samples independently for each triplet, a mini-batch will contain many entity embeddings, which results in accessing many embeddings.
We deploy a joint negative sampling to reduce the number of entities involved in a mini-batch. In this approach, instead of independently corrupting every triplet times, we group the triplets into sets of size and corrupt them together. For example, when corrupting the tail entities of a set, we uniformly sample entities to replace the tail entities of that set. We corrupt the head entities in a similar fashion. This negative sampling strategy introduces two benefits. First, it reduces the number of entities involved in a mini-batch, resulting in a smaller amount of data access. For a -dimensional embedding, each mini-batch of size now only needs to access instead of words of memory. When grows as large as , the amount of data accessed by this negative sampling is about times smaller ( is usually in the order of 1000). This benefit is more significant in multi-GPU training because we store entity embeddings in CPU memory and send the entity embeddings to the GPUs in every mini-batch. Second, it allows us to replace the original computation with more efficient tensor operations. Inside a group of negative samples, head entities and tail entities are densely connected. We now divide the computation of a score function on a negative sample into two parts. For example, the score function of TransE_l2, , is divided into and . The vector is computed as before because there are only pairs of and . The computation of is converted into a generalized matrix multiplication, which can be performed using highly optimized math libraries. There are pairs of and .
We also deploy non-uniform negative sampling with a probability proportional to the degree of each entity (PBG uses a similar strategy). On a large knowledge graph, uniform negative sampling results in easy negative samples(Kotnis and Nastase, 2017). One way of constructing harder negative samples is to corrupt a triplet with entities sampled proportional to the entity degree. In order to do this efficiently, instead of sampling entities from the entire graph, we construct negative samples with the entities that are already in the mini-batch. This is done by uniformly sampling some of the mini-batch’s triplets and connecting the sampled head (tail) entities with the tail (head) entities of the mini-batch’s triplets to construct the negative samples. Note that this uniform triplet sampling approach leads to an entity sampling approach that is proportional to the entity degree in the mini-batch. In practice, we combine these negative samples with uniformly negative samples to form the full set of negative samples for a mini-batch.
In the distributed training, we sample entities from the local METIS partition to corrupt triplets in a mini-batch to minimize the communication caused by negative samples. This ensures that negative samples do not increase network communication. This strategy in general results in harder negative samples. The corrupted head/tail entities sampled from the local METIS partition are topologically closer to the tail/head entities of the triplets in the batch.
3.4. Relation partitioning
Both GraphVite and PBG treat relation embeddings as dense model weights. As a result, for each mini-batch they incur the cost of retrieving them and updating them. If the number of relations in the knowledge graph is small, this is close to optimal and does not impact the performance. However, when the knowledge graph has a large number of relations (greater than the mini-batch size; ), the number of distinct relations in each mini-batch will be a subset of them and as such, treating them as dense model weights will result in unnecessary data access/transfer overheads. To address this limitation, DGL-KE performs sparse relation embedding reads and sparse gradient updates on relation embeddings. This significantly reduces the amount of data transferred in multi-processing, multi-GPU, and distributed training.
To further reduce the amount of access to relation embeddings in a mini-batch, DGL-KE decomposes the computations among the computing units by introducing a novel relation partitioning approach. This relation partitioning tries (i) to equally distribute the triplets and the relations among the partitions and (ii) to minimize the number of distinct relations that are assigned to each partition as a result of (i). The first goal ensures that the computational and memory requirements are balanced across the computing units, whereas the second goal ensures that the relation-related data that needs to be transferred is minimized. In order to derive such a relation partitioning, we use the following fast greedy algorithm. We sort the relations based on their frequency in non-increasing order. We iterate over the sorted relations and greedily assign a relation to the partition with the smallest number of triplets so far. This strategy usually results in balanced partitioning while ensuring that each relation belongs to only one partition. However, the above algorithm will fail to produce a balance partitioning when the knowledge graph contains relations that are very frequent. In such cases, the number of triplets for those relations may exceed the partition size. To avoid load imbalance, we equally split the most common relations across all partitions. After relation partitioning, we assign a relation partition to a computing unit. This ensures that the majority of relation embeddings are updated by only one process at a time. This optimization applies to many-CPU-core training and multi-GPU training.
A potential drawback of relation partitioning is that it restricts the relations that may appear inside a mini-batch. This reduces the randomization of stochastic gradient descent, which can impact the quality of the embeddings. To tackle this problem, we introduce randomization in the partitioning algorithm and at the start of each epoch we compute a somewhat different relation partitioning.
When we use relation partitioning in multi-GPU training, we store all relation embeddings on GPUs and update relation embeddings in GPUs locally. This is particularly important for KGE models with large model weights on relations, such as TransR and RESCAL. Take TransR for an example. It has an entity projection matrix on each relation, which is much larger than a relation embedding. Moving them to CPU is the bottleneck of the entire computation. If we keep all of these projection matrices in GPUs, the communication overhead drops from to , which is significantly smaller than the naive solution, usually in the order of times smaller.
3.5. Overlap gradient update with batch processing
In multi-GPU training, some of the steps in a mini-batch computation run on CPUs while the others run on GPUs. When we run them in serial in a process, the GPU remains idle when the CPU writes the gradients. To avoid GPU idling, we overlap entity embedding update with the batch computation in the next mini-batch. This allows us to overlap the computation in CPUs and GPUs. Note that even though this approach can potentially increase the staleness of the embeddings used in a mini-batch, the likelihood of that happening is small for knowledge graphs with a sufficiently large number of entities relative to the number of training processes.
To perform this optimization, we split the gradient updates into two parts: one involving relation embeddings, which are updated by the trainer process, and the other involving the entity embeddings, which are off-loaded to a dedicated gradient update process for each trainer process. Once the trainer process finishes writing the relation gradients, it proceeds to the next mini-batch, without having to wait for the writing of the entity gradients to finish. Our experiments show that overlapping gradient updates provide 40% speedup for most of the KGE models on Freebase.
3.6. Other optimizations
Periodic synchronization among processes
When training KGE models with multiprocessing completely independently, different processes may run at a different rate, which results in inconsistent model accuracy. We observe that the trained embeddings sometimes have much worse accuracy at some runs. As such, we add a synchronization barrier among all training processes after a certain number of batches to ensure that all processes train roughly at the same rate. Our observation is that the model can be trained stably if processes synchronize after every few thousand batches.
Distributed Key-Value store
In DGL-KE, we implement a KVStore for model synchronization with efficient C++ back-end. It uses three optimizations that are designed specifically for distributed KGE training. First, because the relations in some knowledge graphs have a long-tail distribution, it reshuffles the relation embeddings in order to avoid single hot-point of KVStore. Second, DGL-KE uses local shared-memory access instead of network communication if the worker processes and KVStore processes are on the same machine. This optimization can significantly reduce networking overhead especially on METIS graph partition. Third, it launches multiple KVStore servers in a single machine to parallelize the computation in KVStore. All KVStore servers inside a machine share embeddings via local shared-memory. Finally, similar to the optimization we used in multi-GPU training, the gradient communication and local gradient computation will be overlapped in KVStore.
4. Related Work
There are a few packages that have been developed to compute embeddings of knowledge graphs efficiently and scale to large knowledge graphs.
OpenKE (Han et al., 2018) is one of the first packages for training knowledge graph embeddings and provides a large list of models. However, it is implemented entirely in Python and cannot scale to very large graphs.
Pytorch-BigGraph (PBG) (Lerer et al., 2019) is developed with an emphasis on scalability to large graphs and distributed training on a cluster of machines. The package does not support GPU training. Although PBG and DGL-KE share similar negative sampling strategies, PBG applies different strategies for distributed training. It randomly divides the adjacency matrix of the graph into 2D blocks and assigns blocks to each machine based on a schedule that avoids conflicts with respect to the entity embeddings. It treats entity embeddings as sparse model weights and relation embeddings as dense model weights. The random 2D partitioning along with the use of dense model weights for relation embeddings results in a large amount of communication, especially for knowledge graphs with many relations.
GraphVite (Zhu et al., 2019) focuses on multi-GPU training and does not support distributed training. When it trains a large knowledge graph, it keeps embeddings on CPU memory. It constructs a subgraph, moves all data in the subgraph to the GPU memory and performs many mini-batch training steps on the subgraph. This method reduces data movement between CPUs and GPUs at the cost of increasing the staleness of the embeddings, which usually results in slower convergence.
5. Experimental Methodology
DGL-KE is implemented in Python and relies on PyTorch for tensor operations, as is the case in PBG, whereas GraphVite is done mostly in C++ with a Python wrapper. We report DGL-KE performance in two broad section: (i) on multi-GPU in section 6.1, many-core CPU in section 6.2 and distributed training in Section 6.3, (ii) against GraphVite (Zhu et al., 2019) and PBG (Lerer et al., 2019) in Section 6.4 on identical hardware.
5.1. Hardware platform
We conduct our evaluation on EC2 CPU and GPU instances, including GraphVite and PBG; see Table 2 for machine configurations.
|EC2 Type||Hardware Config||Eval Section|
|r5dn.24xlarge||2x24 cores, 700GB RAM, 100Gbps network||sec 6.2, 6.3|
|p3.16xlarge||2x16 cores, 500GB RAM, 8 V100 GPUs||sec 6.1|
We used three datasets to evaluate and compare the performance of DGL-KE against that of GraphVite and PBG. Table 3 shows various statistics for these datasets. FB15k and Freebase were derived from the Freebased Knowledge Graph (Bollacker et al., 2008), whereas WN18 was derived from WordNet (Miller, 1995). The FB15k and WN18 datasets are standard benchmarks for evaluating KGE methods. The Freebase dataset corresponds to complete Freebase Knowledge Graph. All datasets are downloaded from (7).
5.3. Evaluation methodology
We evaluated the performance of the different KGE models and methods using a link (relation)-prediction task. In order to train and evaluate the models, we split each dataset into training, validation, and test subsets. For FB15k and WN18, we used the same splits that were used in previous evaluations (Sun et al., 2019) (available in (7)). Freebase is split with 5% of the triplets for validation, 5% for test, and the remaining 90% for training (also available in (7)).
We performed the link-prediction task using two different protocols. The first, which was used for FB15k and WN18, works as follows. For each triplet in the validation/test set, referred to as positive triplet, we generated all possible triplets of the form and by corrupting the head and tail entities. We then removed from them any triplets that already exist in the dataset. The set of triplets that remained form the negative triplets associated with the initial positive triplet. We then used the score function of the model in question (Table 1) to score the triplets. The second protocol, which was used for Freebase, is similar to the first one with the following two differences: (i) we use only 2000 negative triplets; 1000 sampled uniformly from the entire set of negative samples and 1000 sampled proportionally to the degree of the corrupted entities; and (ii) we did not remove from the 2000 negative triplets any triplets that are in the dataset. Note that the reason for the second protocol was due to the size of Freebase, which made the first protocol computationally expensive.
We assessed the performance by using the standard metrics (Lerer et al., 2019) of Hit@ (for ), Mean Rank (MR), and Mean Reciprocal Rank (MRR). All these metrics are derived by comparing how the score of the positive triplet relates to the scores of its associated negative instances. For a positive triplet , let be the list of triplets containing and its associated negative triplets ordered in a non-increasing score order, and let be th position in . Given that, Hit@ is the average number of times the positive triplet is among the highest ranked triplets; MR is the average rank of the positive instances, whereas MRR is the average reciprocal rank of the positive instances. Mathematically, they are defined as
where is the total number of positive triplets and is 1 if , otherwise it is 0. Note that Hit@ and MRR are between 0 and 1, whereas MR ranges from 1 to the .
5.4. Software environment
We run Ubuntu 18.04 on all EC2 instances, where the Python version is 3.6.8 and Pytorch version is 1.3.1. On GPU instances, the CUDA version is 10.0. When comparing the performance of DGL-KE against that of GraphVite and PBG, we use GraphVite v0.2.1 downloaded from Github on November 12 2019 and PBG downloaded from their Github repository on October 15 2019. All frameworks use the same Pytorch version.
For the FB15k and WN18 and all methods (DGL-KE and GraphVite) we performed an extensive hyper-parameter search and report the results that achieve the best performance in terms of MRR, as we believe it is a good measure to assess the overall performance of the methods. Due to the size of Freebase, we only report results for a single set of hyper-parameter values. We use the hyperparameters that perform the best on FB15k for Freebase.
To ensure that the accuracy results are comparable, all methods used exactly the same test set and evaluation protocols described in the previous section.
6.1. Multi-GPU training
Both memory and computing capacity on a multi-GPU machine have a diverse set of characteristics, which make the various optimizations described in Sections 3.3–3.6 relevant. A detailed evaluation of these optimizations follows.
6.1.1. Negative sampling
Joint negative sampling shown in Section 3.3 has two effects: (i) enable more efficient tensor operators and (ii) reduce data movement in multi-GPU training. Figure 3 shows the result. To illustrate the speedup of using more efficient tensor operators, we run the TransE model on FB15k with all data in a single GPU. Joint negative sampling gives about speedup. To illustrate the speedup of reducing data movement, we run the TransE model on FB15k in 8 GPUs, where the entity embeddings are stored in CPU memory. Join negative sampling gets much larger speedup, e.g., about , because naive sampling requires swapping many more entity embeddings between CPU and GPU than joint negative sampling and data communication becomes the bottleneck.
6.1.2. Degree-based negatvie sampling
Although degree-based negative sampling does not speed up training, it improves the model accuracy (Table 4) on Freebase. This suggests that non-uniform negative sampling to generate “hard” negative samples is effective, especially on large knowledge graphs.
6.1.3. Overlap gradient update with batch computation
This technique overlaps the computation of GPUs and CPUs to speed up the training. Figure 4 shows the speedup of using this technique (comparing sync and async) on FB15k and Freebase. It has limited speedup on small knowledge graphs for some models, but it has roughly 40% speedup on Freebase for almost all models. The effectiveness of this optimization depends on the computation time in CPUs and GPUs. Large knowledge graphs, such as Freebase, requires hundreds of GBytes to store the entire entity embeddings and suffers from slow random memory access during entity embedding update. In this case, overlapping the CPU/GPU computation benefits a lot.
6.1.4. Relation partitioning
After relation partitioning, we pin relation embeddings (and projection matrices) in each partition inside certain GPU, which reduces the data movements between CPUs and GPUs. The speedup is highly related to the model size and the number of relations in the dataset. Figure 4 shows the speedup of using relation partitioning in multi-GPU training (comparing async and async + rel_part bar) on FB15k and Freebase. For example, relation partitioning has significant speedup on TransR because the relation-specific projection matrices result in a large amount of data communication between CPU and GPU. Even for models with only relation embeddings, relation partitioning in general gets over 10% speedup.
6.1.5. Overall speed and accuracy
After deploying all of the optimizations evaluated above, we measure the speedup of DGL-KE with multiple GPUs on both FB15k and Freebase. Figure 5 shows that DGL-KE accelerates training almost linearly with multiple GPUs. On Freebase, DGL-KE further speeds up by running 16 processes on 8 GPUs. By running two processes on each GPU, we better utilize the computation in GPUs and PCIe buses by overlapping computation and data communication between CPUs and GPUs.
With all these techniques, we train KGE models efficiently. For small knowledge graphs, such as FB15k, DGL-KE trains most of KGE models, even as complex as RotatE and TransR, within a few minutes. For large knowledge graphs, such as Freebase, DGL-KE trains many of KGE models around one or two hours and trains more complex models within a reasonable time, for example we train TransR in about 8 hours using 8 GPUs.
With a maximum speedup of with single-GPU training, we sacrifice little on accuracy. Table 5 and Table 6 shows the accuracy of DGL-KE with 1 and 8 GPUs on FB15k and Freebase. The 1GPU columns shows the baseline accuracy and the Fastest shows the accuracy with the fastest configuration on 8 GPUs. For FB15k, we achieve the fastest training speed with 8 processes on 8 GPUs, while for Freebase, we use 8 GPUs and 16 concurrent processes. In all experiments, the total number of epochs we run is the same for both the 1GPU and Fastest settings. Here, we only show TransR with 8 GPUs on Freebase because training TransR on one GPU takes very long time.
6.2. Many-core training
6.3. Distributed training
In distributed training, we use 4 r5dn.24xlarge EC2 instances as our cluster environment. In this section, we compare the baseline single-machine training with distributed training using both random partitioning and METIS partitioning on Freebase.
METIS partitioning on distributed training gets nearly speedup compared with the single-machine baseline (Figure 7) without sacrificing any model accuracy (Table 7). The training speed of using METIS partitioning get about 20% speedup over random partitioning because METIS partitioning leads to much lower overhead than random partitioning.
6.4. Overall performance
We evaluate DGL-KE on the datasets in Table 3 and compare with two existing packages: GraphVite and PBG on both CPUs and GPUs. Because GraphVite and PBG only provide a subset of the models in DGL-KE, we only compare with them with the models available in these two packages.
6.4.1. Comparison with GraphVite
DGL-KE is consistently faster than GraphVite on both FB15k and WN18 (Figure 9 and Figure 10) when training all KGE models to reach similar accuracy (Table 8 and Table 9). For most of the models, DGL-KE is faster than GraphVite. This is mainly due to DGL-KE converges faster than GraphVite. In all cases, DGL-KE only needs less than 100 epochs to converge but GraphVite needs thousands of epochs. When evaluating GraphVite, we use the recommended configuration by the package for each algorithm when running on 1 GPU and 4 GPUs, while having some hyperparameter tuning for 8 GPUs to get compatible results with 1 GPU runs. When evaluating DGL-KE, we use the same dimension size of entity and relation embedding as GraphVite, but tune hyper-parameters such as learning rate, negative sample size and batch size, for better accuracy.
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
|1 GPU||4 GPU||8 GPU||1 GPU||4 GPU||8 GPU|
6.4.2. Comparison with PBG
DGL-KE runs twice as fast as PBG when training KGE models on Freebase (Figure 8). There are many factors that contribute to the slower training speed in PBG. One of the major factors is that PBG handles relation embeddings as dense model weights. As such, the computation in a batch involves in all relation embeddings in the graph, which is 10 times more than necessary on Freebase. In contrast, DGL-KE reduces the number of relation embeddings involved in a batch and significantly reduces the amount of computation and data movement.
We develop an efficient package called DGL-KE to train knowledge graph embeddings at large scale. It implements a number of optimization techniques to improve locality, reduce data communication, while harnessing parallel computing capacity. As a result, DGL-KE significantly outperforms the state-of-the-art packages for knowledge graph embeddings on a variety of hardware, including many-core CPU, multi-GPU as well as cluster of machines. Our experiments show that DGL-KE scales well with machine resources almost linearly while still achieving very high model accuracy. DGL-KE is available at https://github.com/awslabs/dgl-ke.
We thank the RotatE authors for making their knowledge graph embedding package KnowledgeGraphEmbedding open-source. DGL-KE was built based on their package.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08. Cited by: §5.2, Table 3.
- Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, Cited by: §1, Table 1, Table 3.
- MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. Cited by: §1.
- Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151, pp. 78–94. Cited by: §1.
OpenKE: an open toolkit for knowledge embedding.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium. Cited by: §4.
- A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20 (1). Cited by: §1, §3.2.
-  (2019 (accessed August 3, 2019)) Knowledge graph datasets in openke. External Links: Cited by: §5.2, §5.3.
- Analysis of the impact of negative sampling on link prediction in knowledge graphs. External Links: Cited by: §3.3.
- PyTorch-biggraph: A large-scale graph embedding system. CoRR abs/1903.12287. Cited by: §1, §4, §5.3, §5.
Learning entity and relation embeddings for knowledge graph completion.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §1, Table 1.
- WordNet: a lexical database for english. Communications of the ACM 38 (11). Cited by: §5.2.
- A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11. Cited by: Table 1.
- Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.
- Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, Cited by: §2.
- RotatE: knowledge graph embedding by relational rotation in complex space. CoRR abs/1902.10197. Cited by: Table 1, §5.3.
- Complex embeddings for simple link prediction. CoRR abs/1606.06357. Cited by: Table 1.
- Deep graph library: towards efficient and scalable deep learning on graphs. Cited by: §1, §3.
- Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12). Cited by: §2.
- Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §1.
- Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations (ICLR) 2015, Cited by: §1, Table 1.
- GraphVite: A high-performance CPU-GPU hybrid system for node embedding. CoRR abs/1903.00757. Cited by: §1, §4, §5.