WikiDataSets : Standardized sub-graphs from WikiData

by   Armand Boschin, et al.
Télécom Paris

Developing new ideas and algorithms in the fields of graph processing and relational learning requires datasets to work with and WikiData is the largest open source knowledge graph involving more than fifty millions entities. It is larger than needed in many cases and even too large to be processed easily but it is still a goldmine of relevant facts and subgraphs. Using this graph is time consuming and prone to task specific tuning which can affect reproducibility of results. Providing a unified framework to extract topic-specific subgraphs solves this problem and allows researchers to evaluate algorithms on common datasets. This paper presents various topic-specific subgraphs of WikiData along with the generic Python code used to extract them. These datasets can help develop new methods of knowledge graph processing and relational learning.



There are no comments yet.


page 1

page 2

page 3

page 4


MIRA: Multihop Relation Prediction in Temporal Knowledge Graphs

In knowledge graph reasoning, we observe a trend to analyze temporal dat...

Pykg2vec: A Python Library for Knowledge Graph Embedding

Pykg2vec is an open-source Python library for learning the representatio...

EventKG+BT: Generation of Interactive Biography Timelines from a Knowledge Graph

Research on notable accomplishments and important events in the life of ...

Efficient Enumeration of Subgraphs and Induced Subgraphs with Bounded Girth

The girth of a graph is the length of its shortest cycle. Due to its rel...

Graph-augmented Learning to Rank for Querying Large-scale Knowledge Graph

Knowledge graph question answering (i.e., KGQA) based on information ret...

Graph4Code: A Machine Interpretable Knowledge Graph for Code

Knowledge graphs have proven to be extremely useful in powering diverse ...

NEFI: Network Extraction From Images

Networks and network-like structures are amongst the central building bl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Relational learning as been a hot topic for a couple of years. The most widespread datasets used for bench-marking new methods ([1] [2] [3], [4], [5], [6]) are a subset of Wordnet (WN18) and two subsets of Freebase (FB15k, FB1M). Recently, WikiData has been used in the literature, as a whole in [7] or as subsets (about people and films) in [8, 9]. It is a diverse and qualitative dataset that could useful for many researchers. In order to make it easier to use, we decided to build thematic subsets which are publicly available111 They are presented in this paper.

2 Overview of the datasets

A knowledge graph is given by a set of nodes (entities) and a set of facts linking those nodes. A fact is a triplet of the form linking two nodes by a typed edge (relation). WikiData is a large knowledge graph, archives of which can be downloaded.

We propose five topic-related datasets which are subgraphs of the WikiData knowledge-graph : animal species, companies, countries, films and humans. Each graph includes only nodes that are instances of its topic. For example, the dataset humans contains the node George Washington because the fact (George Washington, isInstanceOf, human) is true in WikiData.

More exactly, to be included nodes should be instances of the topic or of any sub-class of this topic. For example, countries contains USSR though USSR is not an instance of country. It is however an instance of historical country which is a subclass of country. Topics and corresponding sub-classes are reported in Table 1.

Each graph includes as edges only the relations that stand true in WikiData. For example, the dataset humans contains the nodes George Whashington and Martha Washington and there is an edge between the two as the fact (George Whashington, spouse, Martha Washington) is true.

Eventually, Wikidata facts linking the selected nodes to Wikidata entities that were not selected (because not instances of the topic) are also kept as attributes of the nodes. Note that as all the edges are labeled, each graph is a knowledge-graph itself.

For each dataset, we provide some metadata in Table 2 along with a couple of examples of nodes, edges and attributes in Table 3, the distributions of the edge types in Tables 5, 6, 7, 8, 9 and details of the files in Table 4.

3 Presentation of the code

The proposed datasets were built using the WikiDataSets222 package. Three steps are necessary.

  1. Call to get_subclasses on the WikiData ID of the topic entity (e.g. Q5 for humans) to fetch the various WikiData entities which are sub-classes of the topic entity (e.g. sub-classes of humans).

  2. Call to query_wikidata_dump to read each line of the WikiData archive dump latest-all.json.bz2333 and keep only the lines corresponding to selected nodes. This returns a list of facts stored as pickle files. Labels of the entities and relations are also collected on the way to be able to provide the labels of the attributes. Those labels are collected in English when available.

  3. Eventually build_dataset turns this list of facts into the five files presented in table 4 (facts between nodes, attributes, entity dictionary giving for each entity its label and its WikiData ID, relation dictionary giving for each relation its label and its WikiData id).

4 Use of the datasets

4.1 Community Detection

Using the Scikit-Network444 framework, we extracted communities from the humans dataset with the Louvain algorithm [10]. In order to visualize the communities, the 50 nodes of highest degree were then extracted along with their neighbors. A snapshot of the visualization is presented in Figure 1 and is available at It was created using the 3d-force-graph library555

Navigating through the graph, we find communities that seem to make sense (e.g. American artists, Vietnamese political leaders). We find as well (in pink on Figure 1) the Chinese Tang dynasty, each small ball corresponding to an emperor and its wife and children.

4.2 Knowledge Graph Embedding

Using the TorchKGE666 framework, we embedded the nodes (entities) of the humans dataset.

The model used is TransH with the same meta-parameters as the one recommended in the original paper [2]

for the FB15k dataset. Attributes are not included in the process and nodes with less than 5 neighbors are filtered out. The facts are then randomly split into training (0.8) and testing (0.2) sets. Training was done on Nvidia Titan V GPU during 1,000 epochs.

Evaluation protocol was done following the one presented in [1]. We present in Table 10 the results on a link prediction task.

Dataset topic Wikidata ID sub-class examples
animals taxon Q16521 reptilia classifications, amphibia classifications
companies business Q4830453 low-cost airline, fast food chain
countries country Q6256 arab caliphate, colonial empire
films film Q11424 cartoon, apocalyptic movie
humans human Q5 human being, child, patient
Table 1: Topics and examples of their sub-classes for each dataset.
Dataset # nodes # edges
# isolated
nodes777Nodes that are isolated from the rest of the graph but have attributes.
# distinct
# attribute
# distinct
# distinct attribute
animals 2,617,023 2,747,853 19,574 164,477 563,7811 45 122
companies 249,619 50,475 216,421 94,526 814,068 101 339
countries 3,324 10,198 1,778 17,965 37,096 48 145
films 281,988 9,221 274,062 236,187 3,104,681 44 236
humans 5,043,535 1,059,144 4,538,119 774,494 34,848,421 165 439
Table 2: Metadata of each dataset.
dataset Example of node Example of edge Example of attribute
animals Tiger
parent taxon,
produced sound,
companies Deutsche TeleKom
(Deutsche TeleKom,
T-Mobile US)
(Deutsche Telecom,
award received,
Big Brother Award)
countries France
shares border with,
films Fast Five
(Fast Five,
Fast & Furious)
(Fast Five,
cast member,
Paul Walker)
humans George Washington
(George Washington,
Martha Washington)
(George Washington,
cause of death,
Table 3: Examples of nodes, facts and attributes for each dataset.
File Description
readme.txt Contains meta-data about the dataset
attributes.txt List facts linking nodes to their attributes (in the form from, to, rel).
edges.txt List of facts linking nodes between them (in the form from, to, rel).
entities.txt Dictionary linking entities to their WikiData codes and labels.
nodes.txt Subset of entities.txt containing only nodes of the graph
relations.txt Dictionary linking relations to their WikiData codes and labels.
Table 4: Details of the files for each dataset.
label headEntity
parent taxon 2604202
basionym 97381
taxonomic type 22949
original combination 9077
taxon synonym 6967
different from 2578
this zoological name is coordinate with 1885
host 994
parent of this hybrid, breed, or cultivar 470
replaced synonym (for nom. nov.) 367
subclass of 179
said to be the same as 162
afflicts 129
instance of 88
based on 83
named after 53
has part 53
derivative work 42
main food source 33
follows 17
followed by 17
Table 5: Distribution of the top 20 edge types in the animals dataset.
label headEntity
parent organization 11850
owned by 10783
subsidiary 8574
owner of 4669
instance of 2840
followed by 1726
follows 1637
replaced by 1300
replaces 1053
operator 993
business division 569
member of 565
different from 557
founded by 536
stock exchange 528
part of 444
architect 237
has part 190
manufacturer 146
distributor 144
industry 114
Table 6: Distribution of the top 20 edge types in the companies dataset.
label headEntity
diplomatic relation 6009
shares border with 1202
country 1090
replaced by 362
replaces 323
follows 263
followed by 263
located in the administrative territorial entity 157
has part 106
part of 89
different from 74
contains administrative territorial entity 60
capital 24
capital of 19
territory claimed by 18
separated from 14
instance of 11
facet of 11
named after 10
headquarters location 10
founded by 9
Table 7: Distribution of the top 20 edge types in the countries dataset.
label headEntity
follows 3070
followed by 3049
different from 910
based on 703
has part 399
part of 348
derivative work 271
said to be the same as 138
part of the series 108
cast member 59
inspired by 29
main subject 17
producer 12
has edition 10
edition or translation of 10
genre 8
production company 7
influenced by 7
award received 6
distributor 5
named after 5
Table 8: Distribution of the top 20 edge types in the films dataset.
label headEntity
child 255885
sibling 241225
father 212661
spouse 123117
mother 44035
student of 29931
different from 27395
student 25578
doctoral advisor 20315
relative 19844
doctoral student 17630
consecrator 11119
influenced by 6936
professional or sports partner 6230
partner 6197
head coach 4248
employer 1309
said to be the same as 670
killed by 477
sponsor 474
replaces 347
Table 9: Distribution of the top 20 edge types in the humans dataset.
Figure 1: Results of the Louvain algorithm on humans dataset. Each node has been assigned a community and each color corresponds to a community.
Dataset # entities # facts Hit@10 Filt. Hit@10 Mean Rank Filt. Mean Rank
Humans 238,376 722,993 0.515 0.618 15,030 15,027
Table 10: Results of TransH model on humans

dataset with hyperparameters from

[2]. The dataset was filtered to keep only entities involved in more than 5 facts.