The number and size of knowledge graphs (KGs) available on the Web and in companies grows steadily.111https://lod-cloud.net/ For example, more than 150 billion facts describing more than 3 billion things are available in the more than 10,000 knowledge graphs published on the Web as Linked Data.222lodstats.aksw.org Knowledge graph embedding (KGE) approaches aim to map the entities contained in knowledge graphs to -dimensional vectors [19, 13, 22]
. Accordingly, they parallel word embeddings from the field of natural language processing[11, 14]
and the improvement they brought about in various tasks (e.g., word analogy, question answering, named entity recognition and relation extraction). Applications of KGEs include collective machine learning, type prediction, link prediction, entity resolution, knowledge graph completion and question answering[13, 2, 12, 19, 22, 15]. In this work, we focus on type prediction. We present a novel approach for KGE based on a physical model, which goes beyond the state of the art (see  for a survey) w.r.t. both efficiency and effectiveness. Our approach, dubbed Pyke, combines a physical model (based on Hooke’s law) with an optimization technique inspired by simulated annealing. Pyke scales to large KGs by achieving a linear space complexity while being close to linear in its time complexity on large KGs. We compare the performance of Pyke with that of six state-of-the-art approaches—Word2Vec , ComplEx , RESCAL , TransE , DistMult  and Canonical Polyadic (CP) decomposition — on two tasks, i.e., clustering and type prediction w.r.t. both runtime and prediction accuracy. Our results corroborate our formal analysis of Pyke and suggest that our approach scales close to linearly with the size of the input graph w.r.t. its runtime. In addition to outperforming the state of the art w.r.t. runtime, Pyke also achieves better cluster purity and type prediction scores.
The rest of this paper is structured as follows: after providing a brief overview of related work in Section 2, we present the mathematical framework underlying Pyke in Section 3. Thereafter, we present Pyke in Section 4. Section 5 presents the space and time complexity of Pyke. We report on the results of our experimental evaluation in Section 6. Finally, we conclude with a discussion and an outlook on future work in Section 7.
2 Related Work
A large number of KGE approaches have been developed to address tasks such as link prediction, graph completion and question answering [7, 8, 12, 13, 18] in the recent past. In the following, we give a brief overview of some of these approaches. More details can be found in the survey at . RESCAL 
is based on computing a three-way factorization of an adjacency tensor representing the input KG. The adjacency tensor is decomposed into a product of a core tensor and embedding matrices.RESCAL captures rich interactions in the input KG but is limited in its scalability. HolE uses circular correlation as its compositional operator. Holographic embeddings of knowledge graphs yield state-of-the-art results on link prediction task while keeping the memory complexity lower than RESCAL and TransR . ComplEx  is a KGE model based on latent factorization, wherein complex valued embeddings are utilized to handle a large variety of binary relations including symmetric and antisymmetric relations.
Energy-based KGE models [1, 2, 3] yield competitive performances on link prediction, graph completion and entity resolution. SE  proposes to learn one low-dimensional vector () for each entity and two matrices (, ) for each relation. Hence, for a given triple (), SE aims to minimize the distance, i.e., . The approach in  embeds entities and relations into the same embedding space and suggests to capture correlations between entities and relations by using multiple matrix products. TransE  is a scalable energy-based KGE model wherein a relation between entities and corresponds to a translation of their embeddings, i.e., provided that exists in the KG. TransE outperforms state-of-the-art models in the link prediction task on several benchmark KG datasets while being able to deal with KGs containing up to 17 million facts. DistMult  proposes to generalize neural-embedding models under an unified learning framework, wherein relations are bi-linear or linear mapping function between embeddings of entities.
With Pyke, we propose a different take to generating embeddings by combining a physical model with simulated annealing. Our evaluation suggests that this simulation-based approach to generating embeddings scales well (i.e., linearly in the size of the KG) while outperforming the state of the art in the type prediction and clustering quality tasks [21, 20].
3 Preliminaries and Notation
In this section, we present the core notation and terminology used throughout this paper. The symbols we use and their meaning are summarized in Table 1.
3.1 Knowledge Graph
In this work, we compute embeddings for RDF KGs. Let be the set of all RDF resources, be the set of all RDF blank nodes, be the set of all properties and denote the set of all RDF literals. An RDF KG is a set of RDF triples where , and . We aim to compute embeddings for resources and blank nodes. Hence, we define the vocabulary of an RDF knowledge graph as . Essentially, stands for all the URIs and blank nodes found in . Finally, we define the subjects with type information of as , where rdf:type stands for the instantiation relation in RDF.
|An RDF knowledge graph|
|Set of all RDF resources, predicates, blank nodes and literals respectively|
|Set of all RDF subjects with type information|
|Similarity function on|
|Embedding of at time|
|Attractive and repulsive forces, respectively|
|Threshold for positive and negative examples|
|Function mapping each to a set of attracting elements of|
|Function mapping each to a set of repulsive elements of|
|Upper bound on alteration of locations of across two iterations|
3.2 Hooke’s Law
Hooke’s law describes the relation between a deforming force on a spring and the magnitude of the deformation within the elastic regime of said spring. The increase of a deforming force on the spring is linearly related to the increase of the magnitude of the corresponding deformation. In equation form, Hooke’s law can be expressed as follows:
where is the deforming force, is the magnitude of deformation and is the spring constant. Let us assume two points of unit mass located at and respectively. We assume that the two points are connected by an ideal spring with a spring constant , an infinite elastic regime and an initial length of 0. Then, the force they are subjected to has a magnitude of . Note that the magnitude of this force grows with the distance between the two mass points.
The inverse of Hooke’s law, where
has the opposite behavior. It becomes weaker with the distance between the two mass points it connects.
3.3 Positive Pointwise Mutual Information
The Positive Pointwise Mutual Information (PPMI) is a means to capture the strength of the association between two events (e.g., appearing in a triple of a KG). Let and be two events. Let stand for the joint probability of and , for the probability of and for the probability of . Then, is defined as
The equation truncates all negative values to 0 as measuring the strength of dissociation between events accurately demands very large sample sizes, which are empirically seldom available.
In this section, we introduce our novel KGE approach dubbed Pyke (a physical model for knowledge graph embeddings). Section 4.1 presents the intuition behind our model. In Section 4.2, we give an overview of the Pyke framework, starting from processing the input KG to learning embeddings for the input in a vector space with a predefined number of dimensions. The workflow of our model is further elucidated using the running example shown in Figure 1.
Pyke is an iterative approach that aims to represent each element of the vocabulary of an input KG as an embedding (i.e., a vector) in the -dimensional space . Our approach begins by assuming that each element of is mapped to a single point (i.e., its embedding) of unit mass whose location can be expressed via an -dimensional vector in according to an initial (e.g., random) distribution at iteration . In the following, we will use to denote the embedding of at iteration . We also assume a similarity function (e.g., a PPMI-based similarity) over to be given. Simply put, our goal is to improve this initial distribution iteratively over a predefined maximal number of iterations (denoted ) by ensuring that
the embeddings of similar elements of are close to each other while
the embeddings of dissimilar elements of are distant from each other.
Let be the distance (e.g., the Euclidean distance) between two embeddings in . According to our goal definition, a good iterative embedding approach should have the following characteristics:
If , then . This means that the embeddings of similar terms should become more similar with the number of iterations. The same holds the other way around:
If , then .
We translate into our model as follows: If and are similar (i.e., if ), then a force of attraction must exist between the masses which stand for and at any time . must be proportional to , i.e., the attraction between must grow with the distance between and . These conditions are fulfilled by setting the following force of attraction between the two masses:
From the perspective of a physical model, this is equivalent to placing a spring with a spring constant of between the unit masses which stand for and . At time , these masses are hence accelerated towards each other with a total acceleration proportional to .
The translation of into a physical model is as follows: If and are not similar (i.e., if ), we assume that they are dissimilar. Correspondingly, their embeddings should diverge with time. The magnitude of the repulsive force between the two masses representing and should be strong if the masses are close to each other and should diminish with the distance between the two masses. We can fulfill this condition by setting the following repulsive force between the two masses:
where denotes a constant, which we dub the repulsive constant. At iteration , the embeddings of dissimilar terms are hence accelerated away from each other with a total acceleration proportional to . This is the inverse of Hooke’s law, where the magnitude of the repulsive force between the mass points which stand for two dissimilar terms decreases with the distance between the two mass points.
Based on these intuitions, we can now formulate the goal of Pyke formally: We aim to find embeddings for all elements of which minimize the total distance between similar elements and maximize the total distance between dissimilar elements. Let be a function which maps each element of to the subset of it is similar to. Analogously, let map each element of to the subset of it is dissimilar to. Pyke aims to optimize the following objective function:
Pyke implements the intuition described above as follows: Given an input KG , Pyke first constructs a symmetric similarity matrix of dimensions . We will use to denotes the similarity coefficient between and stored in . Pyke truncates this matrix to (1) reduce the effect of oversampling and (2) accelerate subsequent computations. The initial embeddings of all in are then determined. Subsequently, Pyke uses the physical model described above to improve the embeddings iteratively. The iteration is ran at most times or until the objective function stops decreasing. In the following, we explain each of the steps of the approach in detail. We use the RDF graph shown in Figure 1 as a running example.333This example is provided as an example in the DL-Learner framework at http://dl-learner.org.
4.2.1 Building the similarity matrix.
For any two elements , we set in our current implementation. We compute the probabilities , and as follows:
4.2.2 Computing and .
To avoid oversampling positive or negative examples, we only use a portion of for the subsequent optimization of our objective function. For each , we begin by computing by selecting resources which are most similar to . Note that if less than resources have a non-zero similarity to , then contains exactly the set of resources with a non-zero similarity to . Thereafter, we sample elements of with randomly. We call this set . For all , we set to , where is our repulsive constant. The values of for are preserved. All other values are set to 0. After carrying out this process for all , each row of now contains exactly non-zero entries provided that each has at least resources with non-zero similarity. Given that , is now sparse and can be stored accordingly.444We use for the sake of explanation. For practical applications, this step can be implemented using priority queues, hence making quadratic space complexity for storing unnecessary. The PPMI similarity matrix for our example graph is shown in Figure 2.
4.2.3 Initializing the embeddings.
Each is mapped to a single point of unit mass in at iteration . As exploring sophisticated initialization techniques is out of the scope of this paper, the initial vector is set randomly.555 Preliminary experiments suggest that applying a singular value decomposition on most salient eigenvectors has the potential of accelerating the convergence of our approach.
Preliminary experiments suggest that applying a singular value decomposition onand initializing the embeddings with the latent representation of the elements of the vocabulary along the
most salient eigenvectors has the potential of accelerating the convergence of our approach.Figure 3 shows a 3D projection of the initial embeddings for our running example (with ).
This is the crux of our approach. In each iteration , our approach assumes that the elements of attract with a total force
On the other hand, the elements of repulse with a total force
We assume that exactly one unit of time elapses between two iterations. The embedding of at iteration can now be calculated by displacing proportionally to .However, implementing this model directly leads to a chaotic (i.e., non-converging) behavior in most cases. We enforce the convergence using an approach borrowed from simulated annealing, i.e., we reduce the total energy of the system by a constant factor after each iteration. By these means, we can ensure that our approach always terminates, i.e., we can iterate until does not decrease significantly or until a maximal number of iterations is reached.
Algorithm 1 shows the pseudocode of our approach. Pyke updates the embeddings of vocabulary terms iteratively until one of the following two stopping criteria is satisfied: Either the upper bound on the iterations is met or a lower bound on the total change in the embeddings (i.e., ) is reached. A gradual reduction in the system energy inherently guarantees the termination of the process of learning embeddings. A 3D projection of the resulting embedding for our running example is shown in Figure 3.
5 Complexity Analysis
5.1 Space complexity
Let . We would need at most entries to store , as the matrix is symmetric and we do not need to store its diagonal. However, there is actually no need to store . We can implement as a priority queue of size in which the indexes of elements of most similar to as well as their similarity to are stored. can be implemented as a buffer of size which contains only indexes. Once reaches its maximal size , then new entries (i.e., with ) are added randomly. Hence, we need space to store both and . Note that . The embeddings require exactly space as we store and for each . The force vectors and each require a space of . Hence, the space complexity of Pyke lies clearly in and is hence linear w.r.t. the size of the input knowledge graph when the number of dimensions of the embeddings and the number of positive and negative examples are fixed.
5.2 Time complexity
Initializing the embeddings requires operations. The initialization of and can also be carried out in linear time. Adding an element to and is carried out at most times. For each , the addition of an element to has a runtime of at most . Adding elements to is carried out in constant time, given that the addition is random. Hence the computation of and can be carried out in linear time w.r.t. . This computation is carried out times, i.e., once for each . Hence, the overall runtime of the initialization for Pyke is on . Importantly, the update of the position of each can be carried out in , leading to each iteration having a time complexity of . The total runtime complexity for the iterations is hence , which is linear in . This result is of central importance for our subsequent empirical results, as the iterations make up the bulk of Pyke’s runtime. Hence, Pyke’s runtime should be close to linear in real settings.
6.1 Experimental Setup
The goal of our evaluation was to compare the quality of the embeddings generated by Pyke with the state of the art. Given that there is no intrinsic measure for the quality of embeddings, we used two extrinsic evaluation scenarios. In the first scenario, we measured the type homogeneity of the embeddings generated by the KGE approaches we considered. We achieved this goal by using a scalable approximation of DBSCAN dubbed HDBSCAN . In our second evaluation scenario, we compared the performance of Pyke on the type prediction task against that of 6 state-of-the-art algorithms. In both scenarios, we only considered embeddings of the subset of as done in previous works [10, 17]. We set , and throughout our experiments. The values were computed using a Sobol Sequence optimizer . All experiments were carried out on a single core of a server running Ubuntu 18.04 with GB RAM with 16 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processors.
We used six datasets (2 real, 4 synthetic) throughout our experiments. An overview of the datasets used in our experiments is shown in Table 2. Drugbank666download.bio2rdf.org/#/release/4/drugbank is a small-scale KG, whilst the DBpedia (version 2016-10) dataset is a large cross-domain dataset.777 Note that we compile the DBpedia datasets by merging the dumps of mapping-based objects, skos categories and instance types provided in the DBpedia download folder for version 2016-10 at downloads.dbpedia.org/2016-10. The four synthetic datasets were generated using the LUBM generator  with 100, 200, 500 and 1000 universities.
We evaluated the homogeneity of embeddings by measuring the purity  of the clusters generated by HDBSCAN . The original cluster purity equation assumes that each element of a cluster is mapped to exactly one class . Given that a single resource can have several types in a knowledge graph (e.g., BarackObama is a person, a politician, an author and a president in DBpedia), we extended the cluster purity equation as follows: Let be the set of all classes found in . Each was mapped to a binary type vector of length . The ith entry of was 1 iff was of type . In all other cases, was set to 0. Based on these premises, we computed the purity of a clustering as follows:
where are the clusters computed by HDBSCAN. A high purity means that resources with similar type vectors (e.g., presidents who are also authors) are located close to each other in the embedding space, which is a wanted characteristic of a KGE.
In our second evaluation, we performed a type prediction experiment in a manner akin to [10, 17]. For each resource , we used the closest embeddings of to predict ’s type vector. We then compared the average of the types predicted with
’s known type vector using the cosine similarity:
where stands for the neareast neighbors of . We employed 1, 3, 5, 10, 15, 30, 50, 100 in our experiments.
Preliminary experiments showed that performing the cluster purity and type prediction evaluations on embeddings of large knowledge graphs is prohibited by the long runtimes of the clustering algorithm. For instance, HDBSCAN did not terminate in 20 hours of computation when . Consequently, we had to apply HDBSCAN on embeddings on the subset of on DBpedia which contained resources of type Person or Settlement. The resulting subset of on DBpedia consists of RDF resources. For the type prediction task, we sampled resources from according to a random distribution and fixed them across the type prediction experiments for all KGE models.
6.2.1 Cluster Purity Results.
Table 3 displays the cluster purity results for all competing approaches. Pyke achieves a cluster purity of 0.75 on Drugbank and clearly outperforms all other approaches. DBpedia turned out to be a more difficult dataset. Still, Pyke was able to outperform all state-of-the-art approaches by between 11% and 26% (absolute) on Drugbank and between 9% and 23% (absolute) on DBpedia. Note that in 3 cases, the implementations available were unable to complete the computation of embeddings within 24 hours.
6.2.2 Type Prediction Results.
Figure 4 and Figure 5 show our type prediction results on the Drugbank and DBpedia datasets. Pyke outperforms all state-of-the-art approaches across all experiments. In particular, it achieves a margin of up to 22% (absolute) on Drugbank and 23% (absolute) on DBpedia. Like in the previous experiment, all KGE approaches perform worse on DBpedia, with prediction scores varying between and .
6.2.3 Runtime Results.
Table 5 show runtime performances of all models on the two real benchmark datasets, while Figure 6 display the runtime of Pyke on the synthetic LUBM datasets. Our results support our original hypothesis. The low space and time complexities of Pyke mean that it runs efficiently: Our approach achieves runtimes of only 25 minutes on Drugbank and 309 minutes on DBpedia, while outperforming all other approaches by up to 14 hours in runtime.
In addition to evaluating the runtime of Pyke
on synthetic data, we were interested in determining its behaviour on datasets of growing sizes. We used LUBM datasets and computed a linear regression of the runtime using ordinary least squares (OLS). The runtime results for this experiment are shown inFigure 6. The linear fit shown in Table 4 achieves values beyond 0.99, which points to a clear linear fit between Pyke’s runtime and the size of the input dataset.
|Pyke||25 1||309 1|
|TransE||68 1||685 1|
|CP||230 1||1154 1|
|DistMult||210 1||1030 1|
Runtime performances (in minutes) of all competing approaches. All approaches were executed three times on each dataset. The reported results are the mean and standard deviation of the last two runs. The best results are marked in bold. Experiments marked with * did not terminate after 24 hours of computation.
We believe that the good performance of Pyke stems from (1) its sampling procedure and (2) its being akin to a physical simulation. Employing PPMI to quantify the similarity between resources seems to yield better sampling results than generating negative examples using the local closed word assumption that underlies sampling procedures of all of competing state-of-the-art KG models. More importantly, positive and negative sampling occur in our approach per resource rather than per RDF triple. Therefore, Pyke is able to leverage more from negative and positive sampling. By virtue of being akin to a physical simulation, Pyke is able to run efficiently even when each resource is mapped to 45 attractive and 45 repulsive resources (see Table 5) whilst all state-of-the-art KGE required more computation time.
We presented Pyke, a novel approach for the computation of embeddings on knowledge graphs. By virtue of being akin to a physical simulation, Pyke retains a linear space complexity. This was proven through a complexity analysis of our approach. While the time complexity of the approach is quadratic due to the computation of and , all other steps are linear in their runtime complexity. Hence, we expected our approach to behave closes to linearly. Our evaluation on LUBM datasets suggests that this is indeed the case and the runtime of our approach grows close to linearly. This is an important result, as it means that our approach can be used on very large knowledge graphs and return results faster than popular algorithms such as Word2VEC and TransE. However, time efficiency is not all. Our results suggest that Pyke
outperforms state-of-the-art approaches in the two tasks of type prediction and clustering. Still, there is clearly a lack of normalized evaluation scenarios for knowledge graph embedding approaches. We shall hence develop such benchmarks in future works. Our results open a plethora of other research avenues. First, the current approach to compute similarity between entities/relations on KGs is based on the local similarity. Exploring other similarity means will be at the center of future works. In addition, using a better initialization for the embeddings should lead to faster convergence. Finally, one could use a stochastic approach (in the same vein as stochastic gradient descent) to further improve the runtime ofPyke.
-  (2014) A semantic matching energy function for learning with multi-relational data. Machine Learning. Cited by: §2.
-  (2013) Translating embeddings for modeling multi-relational data. Cited by: §1, §2.
Learning structured embeddings of knowledge bases.
Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §2.
Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, Cited by: §6.1, §6.1.
-  (2005) LUBM: a benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web 3 (2-3), pp. 158–182. Cited by: §6.1.
-  (1927) The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics 6 (1-4), pp. 164–189. Cited by: §1.
-  (2019) Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Cited by: §2.
-  (2015) Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §2.
-  (2010) Introduction to information retrieval. Natural Language Engineering. Cited by: §6.1.
-  (2016) Type prediction in rdf knowledge bases using hierarchical multilabel classification. In Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics, pp. 14. Cited by: §6.1, §6.1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, Cited by: §1.
-  Holographic embeddings of knowledge graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 1955–1961. Cited by: §1, §2.
-  (2011) A three-way model for collective learning on multi-relational data.. In ICML, Vol. 11. Cited by: §1, §2.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Cited by: §1.
-  (2016) RDF2Vec: rdf graph embeddings for data mining. In International Semantic Web Conference, Cited by: §1.
-  (2010) Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index. Computer Physics Communications 181 (2), pp. 259–270. Cited by: §6.1.
-  (2017) Towards holistic concept representations: embedding relational knowledge, visual attributes, and distributional word semantics. In International Semantic Web Conference, Cited by: §6.1, §6.1.
-  (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, Cited by: §1, §2.
-  (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1, §2.
-  (2017) Community preserving network embedding. In AAAI, Cited by: §2.
-  Representation learning of knowledge graphs with entity descriptions. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 2659–2665. Cited by: §2.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §1, §2.