1 Motivation
Relational learning as been a hot topic for a couple of years. The most widespread datasets used for bench-marking new methods ([1] [2] [3], [4], [5], [6]) are a subset of Wordnet (WN18) and two subsets of Freebase (FB15k, FB1M). Recently, WikiData has been used in the literature, as a whole in [7] or as subsets (about people and films) in [8, 9]. It is a diverse and qualitative dataset that could useful for many researchers. In order to make it easier to use, we decided to build thematic subsets which are publicly available111https://graphs.telecom-paristech.fr/. They are presented in this paper.
2 Overview of the datasets
A knowledge graph is given by a set of nodes (entities) and a set of facts linking those nodes. A fact is a triplet of the form linking two nodes by a typed edge (relation). WikiData is a large knowledge graph, archives of which can be downloaded.
We propose five topic-related datasets which are subgraphs of the WikiData knowledge-graph : animal species, companies, countries, films and humans. Each graph includes only nodes that are instances of its topic. For example, the dataset humans contains the node George Washington because the fact (George Washington, isInstanceOf, human) is true in WikiData.
More exactly, to be included nodes should be instances of the topic or of any sub-class of this topic. For example, countries contains USSR though USSR is not an instance of country. It is however an instance of historical country which is a subclass of country. Topics and corresponding sub-classes are reported in Table 1.
Each graph includes as edges only the relations that stand true in WikiData. For example, the dataset humans contains the nodes George Whashington and Martha Washington and there is an edge between the two as the fact (George Whashington, spouse, Martha Washington) is true.
Eventually, Wikidata facts linking the selected nodes to Wikidata entities that were not selected (because not instances of the topic) are also kept as attributes of the nodes. Note that as all the edges are labeled, each graph is a knowledge-graph itself.
3 Presentation of the code
The proposed datasets were built using the WikiDataSets222https://pypi.org/project/wikidatasets/ package. Three steps are necessary.
-
Call to get_subclasses on the WikiData ID of the topic entity (e.g. Q5 for humans) to fetch the various WikiData entities which are sub-classes of the topic entity (e.g. sub-classes of humans).
-
Call to query_wikidata_dump to read each line of the WikiData archive dump latest-all.json.bz2333https://dumps.wikimedia.org/wikidatawiki/entities/ and keep only the lines corresponding to selected nodes. This returns a list of facts stored as pickle files. Labels of the entities and relations are also collected on the way to be able to provide the labels of the attributes. Those labels are collected in English when available.
-
Eventually build_dataset turns this list of facts into the five files presented in table 4 (facts between nodes, attributes, entity dictionary giving for each entity its label and its WikiData ID, relation dictionary giving for each relation its label and its WikiData id).
4 Use of the datasets
4.1 Community Detection
Using the Scikit-Network444https://pypi.org/project/scikit-network/ framework, we extracted communities from the humans dataset with the Louvain algorithm [10]. In order to visualize the communities, the 50 nodes of highest degree were then extracted along with their neighbors. A snapshot of the visualization is presented in Figure 1 and is available at https://graphs.telecom-paristech.fr/human_graph.html. It was created using the 3d-force-graph library555https://github.com/vasturiano/3d-force-graph.
Navigating through the graph, we find communities that seem to make sense (e.g. American artists, Vietnamese political leaders). We find as well (in pink on Figure 1) the Chinese Tang dynasty, each small ball corresponding to an emperor and its wife and children.
4.2 Knowledge Graph Embedding
Using the TorchKGE666https://pypi.org/project/torchkge/ framework, we embedded the nodes (entities) of the humans dataset.
The model used is TransH with the same meta-parameters as the one recommended in the original paper [2]
for the FB15k dataset. Attributes are not included in the process and nodes with less than 5 neighbors are filtered out. The facts are then randomly split into training (0.8) and testing (0.2) sets. Training was done on Nvidia Titan V GPU during 1,000 epochs.
Evaluation protocol was done following the one presented in [1]. We present in Table 10 the results on a link prediction task.
Dataset | topic | Wikidata ID | sub-class examples |
---|---|---|---|
animals | taxon | Q16521 | reptilia classifications, amphibia classifications |
companies | business | Q4830453 | low-cost airline, fast food chain |
countries | country | Q6256 | arab caliphate, colonial empire |
films | film | Q11424 | cartoon, apocalyptic movie |
humans | human | Q5 | human being, child, patient |
Dataset | # nodes | # edges |
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
animals | 2,617,023 | 2,747,853 | 19,574 | 164,477 | 563,7811 | 45 | 122 | ||||||||||
companies | 249,619 | 50,475 | 216,421 | 94,526 | 814,068 | 101 | 339 | ||||||||||
countries | 3,324 | 10,198 | 1,778 | 17,965 | 37,096 | 48 | 145 | ||||||||||
films | 281,988 | 9,221 | 274,062 | 236,187 | 3,104,681 | 44 | 236 | ||||||||||
humans | 5,043,535 | 1,059,144 | 4,538,119 | 774,494 | 34,848,421 | 165 | 439 |
dataset | Example of node | Example of edge | Example of attribute | ||||||
---|---|---|---|---|---|---|---|---|---|
animals | Tiger |
|
|
||||||
companies | Deutsche TeleKom |
|
|
||||||
countries | France |
|
|
||||||
films | Fast Five |
|
|
||||||
humans | George Washington |
|
|
File | Description |
---|---|
readme.txt | Contains meta-data about the dataset |
attributes.txt | List facts linking nodes to their attributes (in the form from, to, rel). |
edges.txt | List of facts linking nodes between them (in the form from, to, rel). |
entities.txt | Dictionary linking entities to their WikiData codes and labels. |
nodes.txt | Subset of entities.txt containing only nodes of the graph |
relations.txt | Dictionary linking relations to their WikiData codes and labels. |
label | headEntity |
---|---|
parent taxon | 2604202 |
basionym | 97381 |
taxonomic type | 22949 |
original combination | 9077 |
taxon synonym | 6967 |
different from | 2578 |
this zoological name is coordinate with | 1885 |
host | 994 |
parent of this hybrid, breed, or cultivar | 470 |
replaced synonym (for nom. nov.) | 367 |
subclass of | 179 |
said to be the same as | 162 |
afflicts | 129 |
instance of | 88 |
based on | 83 |
named after | 53 |
has part | 53 |
derivative work | 42 |
main food source | 33 |
follows | 17 |
followed by | 17 |
label | headEntity |
---|---|
parent organization | 11850 |
owned by | 10783 |
subsidiary | 8574 |
owner of | 4669 |
instance of | 2840 |
followed by | 1726 |
follows | 1637 |
replaced by | 1300 |
replaces | 1053 |
operator | 993 |
business division | 569 |
member of | 565 |
different from | 557 |
founded by | 536 |
stock exchange | 528 |
part of | 444 |
architect | 237 |
has part | 190 |
manufacturer | 146 |
distributor | 144 |
industry | 114 |
label | headEntity |
---|---|
diplomatic relation | 6009 |
shares border with | 1202 |
country | 1090 |
replaced by | 362 |
replaces | 323 |
follows | 263 |
followed by | 263 |
located in the administrative territorial entity | 157 |
has part | 106 |
part of | 89 |
different from | 74 |
contains administrative territorial entity | 60 |
capital | 24 |
capital of | 19 |
territory claimed by | 18 |
separated from | 14 |
instance of | 11 |
facet of | 11 |
named after | 10 |
headquarters location | 10 |
founded by | 9 |
label | headEntity |
---|---|
follows | 3070 |
followed by | 3049 |
different from | 910 |
based on | 703 |
has part | 399 |
part of | 348 |
derivative work | 271 |
said to be the same as | 138 |
part of the series | 108 |
cast member | 59 |
inspired by | 29 |
main subject | 17 |
producer | 12 |
has edition | 10 |
edition or translation of | 10 |
genre | 8 |
production company | 7 |
influenced by | 7 |
award received | 6 |
distributor | 5 |
named after | 5 |
label | headEntity |
---|---|
child | 255885 |
sibling | 241225 |
father | 212661 |
spouse | 123117 |
mother | 44035 |
student of | 29931 |
different from | 27395 |
student | 25578 |
doctoral advisor | 20315 |
relative | 19844 |
doctoral student | 17630 |
consecrator | 11119 |
influenced by | 6936 |
professional or sports partner | 6230 |
partner | 6197 |
head coach | 4248 |
employer | 1309 |
said to be the same as | 670 |
killed by | 477 |
sponsor | 474 |
replaces | 347 |

Dataset | # entities | # facts | Hit@10 | Filt. Hit@10 | Mean Rank | Filt. Mean Rank |
---|---|---|---|---|---|---|
Humans | 238,376 | 722,993 | 0.515 | 0.618 | 15,030 | 15,027 |
dataset with hyperparameters from
[2]. The dataset was filtered to keep only entities involved in more than 5 facts.References
- [1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-relational Data. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2787–2795. Curran Associates, Inc., 2013.
-
[2]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Knowledge Graph Embedding by Translating on Hyperplanes.
InTwenty-Eighth AAAI Conference on Artificial Intelligence
, June 2014. -
[3]
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao.
Knowledge Graph Embedding via Dynamic Mapping Matrix.
In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages 687–696, Beijing, China, July 2015. Association for Computational Linguistics. - [4] Théo Trouillon and Maximilian Nickel. Complex and Holographic Embeddings of Knowledge Graphs: A Comparison. arXiv:1707.01475 [cs, stat], July 2017. arXiv: 1707.01475.
- [5] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d Knowledge Graph Embeddings. arXiv:1707.01476 [cs], July 2017. arXiv: 1707.01476.
- [6] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 327–333, 2018. arXiv: 1712.02121.
- [7] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. PyTorch-BigGraph: A Large-scale Graph Embedding System. arXiv:1903.12287 [cs, stat], March 2019. arXiv: 1903.12287.
-
[8]
Natalia Ostapuk, Jie Yang, and Philippe Cudre-Mauroux.
ActiveLink: Deep Active Learning for Link Prediction in Knowledge Graphs.
In The World Wide Web Conference, WWW ’19, pages 1398–1408, New York, NY, USA, 2019. ACM. event-place: San Francisco, CA, USA. - [9] Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. Link Prediction on N-ary Relational Data. In The World Wide Web Conference, WWW ’19, pages 583–593, New York, NY, USA, 2019. ACM. event-place: San Francisco, CA, USA.
- [10] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, October 2008.
Comments
There are no comments yet.