A knowledge graph (KG) stores factual information in the form of triples. Today, many such graphs exist for various domains, are publicly available, and are being interlinked. As of 2019, the linked open data cloud  counts more than 1,000 data sets with multiple billions of unique triples.111https://lod-cloud.net/ Knowledge graphs are typically consumed using factual queries for downstream tasks such as question answering. Recently, knowledge graph embedding models are explored as a new way of knowledge graph exploitation. KG embeddings
(KGEs) represent nodes and (depending on the approach) also edges as continuous vectors. One such approach isRDF2Vec 
. It has been used and evaluated for machine learning, entity and document modeling, and for recommender systems. RDF2Vec vectors trained on a large knowledge graph have also been used as background knowledge source for ontology matching .
While it has been shown that KGEs are helpful in many applications, embeddings on larger knowledge graphs can be expensive to train and to use for downstream applications. kgvec2go.org, therefore, allows to easily access and consume concept embeddings through simple Web APIs. Since most downstream applications only require embedding vectors for a small subset of all concepts, computing a complete embedding model or downloading a complete pre-computed one is often not desirable.
With KGvec2go, rather than having to download the complete embedding model, a Web query can be used to obtain only the desired concept in vector representation or even a derived statistic such as the similarity between two concepts. This facilitates downstream applications on less powerful devices, such as smartphones, as well as the application of knowledge graph embeddings in machine learning scenarios where the data scientists do not want to train the models themselves or do not have the means to perform the computations.
Models for four knowledge graphs were learned, namely: DBpedia , WebIsALOD , Wiktionary , and WordNet .
The data set presented here allows to compare the performance of different knowledge graph embeddings on different application tasks. It further allows to combine embeddings from different knowledge graphs in downstream applications. We evaluated the embeddings on three semantic gold standards and also explored the combination of embeddings.
This paper is structured as follows: In the next section, related work will be presented. Section 3 outlines the approach, Section 4 presents the data sets for which an embedding has been trained, Section 5 introduces the Web API that is provided to consume the learned embedding models, and Section 6 evaluates the models on three semantic gold standards. The paper closes with a summary and an outlook on future work.
2 Related Work
For data mining applications, propositional feature vectors are required, i.e., vectors with either binary, nominal, or numerical elements. An RDF knowledge graph does not come with such properties and has to be translated into several feature vectors if it shall be exploited in data mining applications. This process is known as propositionalization [17, 30]. Two basic approaches for knowledge graph propositionalization can be distinguished: (i) Supervised propositionalization where the user has to manually craft features such as multiple ASK queries for nodes of interest and (ii) unsupervised approaches where the user does not have to know the structure of the graph. 
In order to exploit knowledge graphs in data mining applications, embedding models have gained traction over the last years.  distinguish two families of approaches: distance based and semantic matching based approaches. The best known representatives of the first family are translation-based approaches. Given a set of entities and a set of edges as well as triples in the form , usually stated as where and , TransE  trains vectors with the learning objective given that holds. Many similar approaches based on TransE have been proposed such as TransH  or TransA . In the second family, the most well known approaches are RESCAL , DistMult , and HolE .
Another group of approaches exploits language models such as node2vec  and RDF2Vec . This work is based on the latter algorithm.
Given a (knowledge) graph where is the set of vertices and is the set of directed edges, the RDF2Vec approach generates multiple sentences per vertex . An RDF2Vec sentence resembles a walk through the graph starting at a specified vertex . Datatype properties are excluded from the walk generation. After the sentence generation, the word2vec algorithm [21, 22] is applied to train a vector representation for each element and . word2vec is a neural language model. Given the context of a word , where is a set of preceding and succeeding words of , the learning objective of word2vec is to predict . This is known as continuous bag of words model (CBOW). The skip-gram (SG) model is trained the other way around: Given , has to be predicted. Within this training process, defines the size of and is also known as window or window size.
RDF2Vec is different from a pure language model in that it uses a knowledge graph as training corpus. Knowledge graphs are typically more structured than human language and can contain named entities that do not have to be explicitly detected.
While there is an ever-growing number of knowledge graph embeddings, few works have addressed the software infrastructure aspect so far. The OpenKE toolkit  facilitates a unified framework for efficiently training KGEs, but does not address the light-weight exploitation. The closest project to our work is WEmbedder , which, however, only serves embeddings for one single KG, i.e., Wikidata. This makes KGvec2go the first resource serving multiple embedding models simultaneously.
For this work, the RDF2Vec approach has been re-implemented in Java and Python with a more efficient walk generation process. The implementation of the walk generator is publicly available on GitHub222https://github.com/janothan/kgvec2go-walks/.
For the sentence generation, duplicate free random walks with depth = 8 have been generated whereat edges within the sentences are also counted. For WordNet and Wiktionary, 500 walks have been calculated per entity. For WebIsALOD and DBpedia, 100 walks have been created in order to account for the comparatively large size of the knowledge graphs.
The models were trained with the following configuration: skip-gram vectors, window size = 5, number of iterations = 5, negative sampling for optimization, negative samples = 25. Apart from walk-generation adaptations due to the size of the knowledge graphs, the configuration parameters to train the models have been held constant and no data set specific optimizations have been performed in order to allow for comparability.
In addition, a Web API is provided to access the data models in a lightweight way. This allows for easy access to embedding models and to bring powerful embedding models to devices with restrictions in CPU and RAM, such as smart phones. The APIs are introduced in Section 5 The server has been implemented in Python using flask333https://flask.palletsprojects.com/en/1.1.x/ and gensim  and can be run using Apache HTTP Server. Its code is publicly available on GitHub.444https://github.com/janothan/kgvec2go-server/
4 The Data Sets
For this work, four data sets have been embedded which are quickly introduced in the following.
Wiktionary is ”[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language”555https://web.archive.org/web/20190806080601/https://en.wiktionary.org/wiki/Wiktionary. The project is organized similarly to Wikipedia: Everybody can contribute and edit the dictionary. The content is reviewed in a community process. Like Wikipedia, Wiktionary is available in many languages. DBnary  is an RDF version of Wiktionary that is publicly available666http://kaiko.getalp.org/about-dbnary/download/. The DBnary data set makes use of an extended LEMON model  to describe the data. For this work, a recent download from July 2019 of the English Wiktionary has been used.
DBpedia is a well-known linked data set created by extracting structured knowledge from Wikipedia and other Wikimedia projects. The data is publicly available. For this work, the 2016-10 download has been used.777https://wiki.dbpedia.org/downloads-2016-10 Compared to the other knowledge graphs exploited here, DBpedia contains mainly instances such as the industrial rock band Nine Inch Nails (which cannot be found in WordNet or Wiktionary). Therefore, DBpedia is with its instance data complementary to the other, lemma-focused, knowledge graphs.
The WebIsA database  is a data set which consists of hypernymy relations extracted from the Common Crawl888https://commoncrawl.org/, a downloadable copy of the Web. The extraction was performed in an automatic manner through Hearst-like lexico-syntactic patterns. For example, from the sentence ”[…] added that the country has favourable economic agreements with major economic powers, including the European Union.”, the fact isA(european_union, major_economic_power) is extracted999This is a real example, see: http://webisa.webdatacommons.org/417880315.
WebIsALOD  is the Linked Open Data endpoint which allows to query the data in SPARQL.101010http://webisa.webdatacommons.org/ In addition to the endpoint, machine learning was used to assign confidence scores to the extracted triples. The data set of the endpoint is filtered, i.e. it contains a subset of the original WebIsA database, to ensure a higher data quality. The knowledge graph contains instances (like DBpedia) as well as more abstract concepts that can also be found in a dictionary.
WordNet  is a well-known and heavily used database of English word that are grouped in sets which represent one particular meaning, so-called synsets
. The resource is strictly authored. WordNet is publicly available, included in many natural language processing frameworks, and often used in research. An RDF version of the framework is also available for download and was used for this work.111111http://wordnet-rdf.princeton.edu/about/
offers a simple Web API to retrieve: (i) individual vectors for concepts in different data sets, (ii) the cosine similarity between concepts directly, and (iii) the topmost related concepts for any given concept. Alternatively, the full models can be downloaded from the Web site directly.121212http://www.kgvec2go.org/download.html The API is accessed through HTTP GET calls and will provide answers in the form of a JSON string. This allows for a simple usage on any device that has Internet access. In addition, natural words can be used to access the data rather than long URIs that follow their own idiosyncratic pattern as it is common for RDF2Vec embedded models. In the following, we will quickly describe the services that are offered. For a full description of the services as well as a graphical user interface to explore the embeddings, we refer to the Web page kgvec2go.org.
5.1 Get Vector
kgvec2go.org allows to download an individual vector, i.e. a 200 dimensional floating point number array representation of a concept on a particular data set. The HTTP GET call follows the pattern below:
where data_set refers to the data set that shall be used (i.e. one of alod, dbpedia, wiktionary, wordnet) and concept_name to the natural language identifier of the concept (e.g. bed). This call can be used in machine learning scenarios, for instance, where a numerical representation of a concept is required.
For data sets that learn an embedding based on the part-of-speech (POS) of the term, such as WordNet, multiple vectors are returned for one key word if the latter is available in multiple POS such as laugh which occurs as noun and as verb.
5.2 Get Similarity
Given two concepts, kgvec2go.org allows to query a specified data set for the similarity score where refers to perfect similarity.
The HTTP GET call follows the pattern below:
where data_set refers to the set that shall be used and the two concept names refer to the concept labels for which the similarity shall be calculated. This call can be used wherever the similarity or relatedness of two concepts needs to be judged such as in recommender systems or matching tasks. A Web UI is available to try out this call in a Web browser.131313http://www.kgvec2go.org/query.html A screenshot is shown in Figure 1 for the terms France and Europe for the model learned on WebIsALOD.
5.3 Get Closest Concepts
The API is also capable of determining the closest concepts given a concept and a data set. The given concept is mapped to the vector space and compared with all other vectors. Therefore, the call is expensive on large data sets and should rather be used to explore the data set.
The HTTP GET call follows the pattern below:
where data_set refers to the set that shall be used, top_n refers to the number of closest concepts that shall be obtained, and concept_name refers to the written representation of the concept. For data sets that learn an embedding based on the part-of-speech of the term, such as WordNet, all closest concepts are determined for all POS of the term and their scores are summarized. This allows to calculate the closest concepts for a single term, such as sleep, that occurs in multiple POS (in this case as noun and as verb).
A Web UI is available to try out this call in a Web browser.141414http://www.kgvec2go.org/query.html A screenshot is shown in Figure 2 for the term Germany on the trained DBpedia model.
6.1 Evaluation Gold Standards
In order to test whether there is semantic value in the trained vectors, we evaluate them on three data sets: WordSim-353 , SimLex-999 , and MEN . The principle of evaluation is the same for all gold standards used: The system is presented with two words and has to determine their relatedness or similarity; then, the rank correlation (also known as Spearman’s Rho) with the scores in the gold standards is calculated. Higher correlations between the gold standards’ scores and the system’s scores are regarded as better. Pairs with an out of vocabulary term are handled here by returning a similarity of . As the goal of this data set are comparable general purpose embeddings, it is important to note that the embeddings were not specifically trained to perform well on the given tasks. On similarity tasks, for instance, the results would likely improve when antonymy relations were dropped. With other configuration settings, it is also possible to improve the results further on the given evaluation sets; this has, for instance, been done in  where better relatedness/similarity results on WebIsALOD could be achieved with other RDF2Vec configurations.
6.2 Evaluation Mode
The learned models were evaluated on their own on each of the evaluation data sets. In addition, a combination of all data sets was evaluated. Therefore, the individual similarity scores were added. Hence, where is the final similarity score assigned to the concept pair and and describes the individual score of a model trained on a single data set for the same concept pair. This can be done without normalization because (i) all scores are in the same value range (), (ii) out of vocabulary terms receive a score of 0 (so they do not influence the final results), and (iii) because Spearman’s rank correlation is used which is independent of the absolute values – only the rank is considered.
6.3 Evaluation Results
The rank correlations on the three gold standards are summarized in Table 1. It can be seen that the results vary depending on the gold standard used. The Wiktionary data set performs best when it comes to relatedness. The WebIsALOD data set performs similarly well on WS-353 and performs best on MEN. On the SimLex-999 gold standard, WordNet outperforms the other data sets. The performance of DBpedia is significantly worse which is due to many out of vocabulary terms: This particular data set is focused on instance data rather than lexical forms such as angry. The evaluation performed here is, therefore, not optimal for the data set. This can also be observed in the example results depicted in Table 2: While DBpedia and WebIsALOD work well for entities such as Germany, Wiktionary performs better for general words such as loud.
Interestingly, the combined evaluation mode outlined in subsection 6.2 is able to outperform the best individual results on WS-353 ( vs. ) as well as on MEN ( vs. ). On SimLex, the combination of all similarity scores is very close to the best individual score (WordNet). This shows that it can be beneficial to combine several embedding spaces on different data sets.
It is important to note that the vectors were not trained for the specific task at hand. Nonetheless, the combined embeddings perform well on WS-353 albeit top-notch systems for each data set cannot be outperformed. By the lower performance on SimLex-999 and MEN it can be seen that relatedness is better represented in the embedding spaces than actual similarity. This is an intuitive result given that there was no training objective towards similarity.
When looking at the different properties of the knowledge graphs, it can be reasoned that the level of authoring is not important for the performance on the tasks at hand: WebIsALOD embeddings, which are derived from an automatically generated knowledge graph, easily outperform WordNet embeddings, which are derived from a highly authored knowledge base, on WS-353 and MEN.
6.4 Further Remarks
It is also possible to find typical analogies in the data. In this case, two concepts are presented to the model together with a third one for which the system shall determine an analogous concept. In the following examples, the underlined concept is the best concept that the system found given the three non-underlined concepts.
For example, on Wiktionary:
girl is to boy like man is to woman
big is to small like fake is to original
beautiful is to attractive like quick is to rapid
Similar results can be found on instance level. For example, on DBpedia:
Germany is to Angela Merkel like France is to François Hollande151515Note that François Hollande is indeed the president of France as of 2016.
|4||these islands||Joachim Gauck||canada||Federal Republic of Germany|
|6||German Empire||Christian Wulff||italy||High German|
|8||who shot John||Winfried Hassemer||usa||Pietism|
|1||loud||Loud||cons fan||loud (s)|
|3||noiseless||Cometa (HVDC)||weird noise||loud (r)|
|5||noisy||Loob||history of 20th century||loud (a)|
|6||unsilent||Python Server Pages||collective sigh of relief||fruticulose|
|8||quiet||Juan Llort||undesired signal||deep down|
|10||blasting||Lone||complaint of office worker||rhymeless|
7 Summary and Future Work
In this paper, we presented KGvec2go, a resource consisting of trained embedding models on four knowledge graphs. The models were evaluated on three different gold standards. It could be shown, that the trained vectors carry semantic meaning and that a combination of different knowledge graph embeddings can be beneficial in some tasks. Furthermore, a lightweight API was presented which allows to consume the models in a computationally cheap, memory-efficient, and easy way through Web APIs. We are confident that our work eases the usage of knowledge graph embeddings in real-world applications.
For the future, we plan to extend the data set by adding more different embedding models of knowledge graphs to the resource presented, as well as including other knowledge graphs, and to extend the capabilities of the current API. Furthermore, we plan to exploit the trained models for downstream application tasks that profit from the inclusion of background knowledge such as ontology matching and domain specific data integration tasks.
8 Bibliographical References
-  Y. Bengio and Y. LeCun (Eds.) (2013) 1st international conference on learning representations, ICLR 2013, scottsdale, arizona, usa, may 2-4, 2013, workshop track proceedings. External Links: Cited by: 21.
-  (2013) Translating embeddings for modeling multi-relational data. See Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. proceedings of a meeting held december 5-8, 2013, lake tahoe, nevada, united states, Burges et al., pp. 2787–2795. External Links: Cited by: §2.
C. E. Brodley and P. Stone (Eds.) (2014)
Proceedings of the twenty-eighth AAAI conference on artificial intelligence, july 27 -31, 2014, québec city, québec, canada. AAAI Press. External Links: Cited by: 40.
-  (2012) Distributional semantics in technicolor. See 38, pp. 136–145. External Links: Cited by: §6.1.
-  C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.) (2013) Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. proceedings of a meeting held december 5-8, 2013, lake tahoe, nevada, united states. External Links: Cited by: 22.
-  C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.) (2013) Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. proceedings of a meeting held december 5-8, 2013, lake tahoe, nevada, united states. External Links: Cited by: 2.
-  N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eds.) (2016) Proceedings of the tenth international conference on language resources and evaluation LREC 2016, portorož, slovenia, may 23-28, 2016. European Language Resources Association (ELRA). External Links: Cited by: 35.
-  C. d’Amato, M. Fernández, V. A. M. Tamma, F. Lécué, P. Cudré-Mauroux, J. F. Sequeda, C. Lange, and J. Heflin (Eds.) (2017) The semantic web - ISWC 2017 - 16th international semantic web conference, vienna, austria, october 21-25, 2017, proceedings, part II. Lecture Notes in Computer Science, Vol. 10588, Springer. External Links: Cited by: 14.
-  C. Fellbaum (Ed.) (1998) WordNet: An Electronic Lexical Database. Language, Speech, and Communication, MIT Press, Cambridge, Massachusetts. External Links: Cited by: §1, §4.4.
-  (2002) Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20 (1), pp. 116–131. External Links: Cited by: §6.1.
-  P. T. Groth, E. Simperl, A. J. G. Gray, M. Sabou, M. Krötzsch, F. Lécué, F. Flöck, and Y. Gil (Eds.) (2016) The semantic web - ISWC 2016 - 15th international semantic web conference, kobe, japan, october 17-21, 2016, proceedings, part I. Lecture Notes in Computer Science, Vol. 9981. External Links: Cited by: 31.
-  (2016) Node2vec: scalable feature learning for networks. See Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, san francisco, ca, usa, august 13-17, 2016, Krishnapuram et al., pp. 855–864. External Links: Cited by: §2.
-  (2018) OpenKE: an open toolkit for knowledge embedding. In Proceedings of EMNLP, Cited by: §2.
-  (2017) WebIsALOD: providing hypernymy relations extracted from the web as linked open data. See The semantic web - ISWC 2017 - 16th international semantic web conference, vienna, austria, october 21-25, 2017, proceedings, part II, d’Amato et al., pp. 111–119. External Links: Cited by: §1, §4.3.
SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. External Links: Cited by: §6.1.
-  (2016) Locally adaptive translation for knowledge graph embedding. See Proceedings of the thirtieth AAAI conference on artificial intelligence, february 12-17, 2016, phoenix, arizona, USA, Schuurmans and Wellman, pp. 992–998. External Links: Cited by: §2.
-  (2001) Propositionalization Approaches to Relational Data Mining. In Relational Data Mining, S. Džeroski and N. Lavrač (Eds.), pp. 262–291 (en). External Links: Cited by: §2.
-  B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi (Eds.) (2016) Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, san francisco, ca, usa, august 13-17, 2016. ACM. External Links: Cited by: 12.
-  (2015) DBpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. External Links: Cited by: §1.
-  (2012-12) Interchanging Lexical Resources on the Semantic Web. Language Resources and Evaluation 46 (4), pp. 701–719 (en). External Links: Cited by: §4.1.
-  (2013) Efficient estimation of word representations in vector space. See 1st international conference on learning representations, ICLR 2013, scottsdale, arizona, usa, may 2-4, 2013, workshop track proceedings, Bengio and LeCun, External Links: Cited by: §2.
-  (2013) Distributed representations of words and phrases and their compositionality. See Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. proceedings of a meeting held december 5-8, 2013, lake tahoe, nevada, united states, Burges et al., pp. 3111–3119. External Links: Cited by: §2.
-  (2016) Holographic embeddings of knowledge graphs. In Thirtieth Aaai conference on artificial intelligence, Cited by: §2.
-  (2011) A three-way model for collective learning on multi-relational data. In ICML, Vol. 11, pp. 809–816. Cited by: §2.
-  (2017) Wembedder: wikidata entity embedding web service. CoRR abs/1710.04099. External Links: Cited by: §2.
-  (2012) Unsupervised generation of data mining features from linked open data. In Proceedings of the 2nd international conference on web intelligence, mining and semantics, pp. 31. Cited by: §2.
-  (2018) ALOD2Vec Matcher. See Proceedings of the 13th international workshop on ontology matching co-located with the 17th international semantic web conference, om@iswc 2018, monterey, ca, usa, october 8, 2018, Shvaiko et al., pp. 132–137. External Links: Cited by: §1.
-  (2018) Automatic schema matching utilizing hypernymy relations extracted from the web. Mannheim (Englisch). External Links: Cited by: §6.1.
-  (2010-05-22) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Note: http://is.muni.cz/publication/884893/en Cited by: §3.
-  (2014) A comparison of propositionalization strategies for creating features from linked open data. Linked Data for Knowledge Discovery 6. Cited by: §2.
-  (2016) RDF2Vec: RDF graph embeddings for data mining. See The semantic web - ISWC 2016 - 15th international semantic web conference, kobe, japan, october 17-21, 2016, proceedings, part I, Groth et al., pp. 498–514. External Links: Cited by: §1.
-  (2019) RDF2Vec: RDF graph embeddings and their applications. Semantic Web 10 (4), pp. 721–752. External Links: Cited by: §1, §2.
-  (2014) Adoption of the linked data best practices in different topical domains. In International Semantic Web Conference, pp. 245–260. Cited by: §1.
-  D. Schuurmans and M. P. Wellman (Eds.) (2016) Proceedings of the thirtieth AAAI conference on artificial intelligence, february 12-17, 2016, phoenix, arizona, USA. AAAI Press. External Links: Cited by: 16.
-  (2016) A large database of hypernymy relations extracted from the web. See Proceedings of the tenth international conference on language resources and evaluation LREC 2016, portorož, slovenia, may 23-28, 2016, Calzolari et al., External Links: Cited by: §4.3.
-  (2015) DBnary: wiktionary as a lemon-based multilingual lexical resource in RDF. Semantic Web 6 (4), pp. 355–361. External Links: Cited by: §1, §4.1.
-  P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, and M. C. andschm Oktie Hassanzadeh (Eds.) (2018) Proceedings of the 13th international workshop on ontology matching co-located with the 17th international semantic web conference, om@iswc 2018, monterey, ca, usa, october 8, 2018. CEUR Workshop Proceedings, Vol. 2288, CEUR-WS.org. External Links: Cited by: 27.
-  (2012) The 50th annual meeting of the association for computational linguistics, proceedings of the conference, july 8-14, 2012, jeju island, korea - volume 1: long papers. The Association for Computer Linguistics. External Links: Cited by: 4.
-  (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §2.
Knowledge graph embedding by translating on hyperplanes. See Proceedings of the twenty-eighth AAAI conference on artificial intelligence, july 27 -31, 2014, québec city, québec, canada, Brodley and Stone, pp. 1112–1119. External Links: Cited by: §2.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §2.